jmlr jmlr2012 jmlr2012-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
Reference: text
sentIndex sentText sentNum sentScore
1 FR INRIA Lille - Nord Europe, Team SequeL 40 Avenue Halley 59650 Villeneuve d’Ascq, France Editor: Ronald Parr Abstract In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. [sent-7, score-0.288]
2 We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. [sent-8, score-0.213]
3 To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. [sent-9, score-0.235]
4 This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. [sent-10, score-0.291]
5 We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. [sent-11, score-0.385]
6 Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. [sent-12, score-0.555]
7 Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis 1. [sent-13, score-0.213]
8 Introduction Least-squares temporal-difference (LSTD) learning (Bradtke and Barto, 1996; Boyan, 1999) is a widely used algorithm for prediction in general, and in the context of reinforcement learning (RL), for learning the value function V π of a given policy π. [sent-14, score-0.213]
9 LSTD has been successfully applied to a number of problems especially after the development of the least-squares policy iteration (LSPI) algorithm (Lagoudakis and Parr, 2003), which extends LSTD to control by using it in the policy evaluation step of policy iteration. [sent-15, score-0.665]
10 In this bound, it is assumed that the Markov chain possesses a stationary distribution ρπ and the distances are measured according c 2012 Alessandro Lazaric, Mohammad Ghavamzadeh and R´ mi Munos. [sent-20, score-0.344]
11 Yu (2010) has extended this analysis and derived an asymptotic convergence analysis for offpolicy LSTD(λ), that is when the samples are collected following a behavior policy different from the policy π under evaluation. [sent-22, score-0.488]
12 His analysis reveals a critical dependency on the inverse of the smallest eigenvalue of the LSTD’s A matrix (note that the LSTD solution is obtained by solving a system of linear equations Ax = b). [sent-24, score-0.193]
13 Our analysis is for a specific implementation of LSTD that we call pathwise LSTD. [sent-41, score-0.226]
14 Pathwise LSTD has two specific characteristics: 1) it takes a single trajectory generated by the Markov chain induced by policy π as input, and 2) it uses the pathwise Bellman operator (precisely defined in Section 3), which is defined to be a contraction w. [sent-42, score-0.697]
15 We first derive a bound on the performance of the pathwise LSTD solution for a setting that we call Markov design. [sent-46, score-0.313]
16 This bound is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. [sent-48, score-0.34]
17 Then, in the case that the Markov chain admits a stationary distribution ρπ and is β-mixing, we derive generalization bounds w. [sent-49, score-0.403]
18 In this very general setting, it is still possible to derive a bound for the performance of the LSTD solution, v, ˆ evaluated at the states of the trajectory used by the algorithm. [sent-58, score-0.196]
19 Moreover, an analysis of the bound reveals a critical dependency on the smallest strictly positive eigenvalue νn of the sample-based Gram matrix. [sent-59, score-0.221]
20 Then, in the case in which the Markov chain has a stationary distribution ρπ , it is possible to relate the value of νn to the smallest eigenvalue of the Gram matrix defined according to ρπ . [sent-60, score-0.463]
21 Furthermore, it is possible to generalize the previous performance bound over the entire state space under the measure ρπ , when the samples are drawn from a stationary β-mixing process (Theorem 5). [sent-61, score-0.335]
22 The extension in Theorem 6 to the case in which the samples belong to a trajectory generated by a fast mixing Markov chain shows that it is possible to achieve the same performance as in the case of stationary β-mixing processes. [sent-65, score-0.483]
23 Finally, the analysis of LSPI reveals the need for several critical assumptions on the stationary distributions of the policies that are greedy w. [sent-66, score-0.329]
24 In Section 3, we introduce pathwise LSTD by a minor modification to the standard LSTD formulation in order to guarantee the existence of at least one solution. [sent-75, score-0.255]
25 In Section 5, we show how the Markov design bound of Section 4 may be extended when the Markov chain admits a stationary distribution. [sent-77, score-0.39]
26 A deterministic policy π : X → A is a mapping from states to actions. [sent-89, score-0.245]
27 For a given policy π, the MDP M is reduced to a Markov chain M π = X , Rπ , Pπ , γ with the reward function Rπ (x) = r x, π(x) , transition kernel Pπ (·|x) = P · |x, π(x) , and stationary distribution ρπ (if it admits one). [sent-90, score-0.598]
28 The value function of a policy π, V π , is the unique fixed-point of the Bellman operator T π : B (X ;Vmax = Rmax ) → B (X ;Vmax ) defined by 1−γ (T πV )(x) = Rπ (x) + γ X Pπ (dy|x)V (y). [sent-91, score-0.259]
29 In the following sections, to simplify the notation, we remove the dependency to the policy π and use R, P, V , ρ, and T instead of Rπ , Pπ , V π , ρπ , and T π whenever the policy π is fixed and clear from the context. [sent-93, score-0.459]
30 , Xn generated by following the policy, and returns the fixed-point 3044 F INITE -S AMPLE A NALYSIS OF L EAST-S QUARES P OLICY I TERATION Algorithm 1 A pseudo-code for the batch pathwise LSTD algorithm. [sent-122, score-0.244]
31 n Input: Linear space F = span{ϕi , 1 ≤ i ≤ d}, sample trajectory {(xt , rt )}t=1 of the Markov chain Build the feature matrix Φ = [φ(x1 )⊤ ; . [sent-123, score-0.227]
32 The motivation for using the pathwise Bellman operator is that it is γ-contraction in ℓ2 -norm, that is, for any y, z ∈ Rn , we have ||T y − T z||2 = ||γP(y − z)||2 ≤ γ2 ||y − z||2 . [sent-128, score-0.272]
33 We call the solution with minimal norm, α = A+ b, where A+ is the Moore-Penrose pseudo-inverse of A, the pathwise LSTD solution. [sent-137, score-0.245]
34 Markov Design Bound In Section 3, we defined the pathwise Bellman operator with a slight modification in the definition of the empirical Bellman operator T , and showed that the operator ΠT always has a unique fixed point v. [sent-140, score-0.364]
35 In this section, we derive a bound for the performance of v evaluated at the states of the ˆ ˆ trajectory used by the pathwise LSTD algorithm. [sent-141, score-0.422]
36 , Xn be a trajectory generated by the Markov chain, and v, v ∈ Rn be the ˆ n vectors whose components are the value function and the pathwise LSTD solution at {Xt }t=1 , respectively. [sent-147, score-0.359]
37 Remark 4 Theorem 1 provides a bound without any reference to the stationary distribution of the Markov chain. [sent-178, score-0.295]
38 In fact, the bound of Equation 1 holds even when the chain does not admit a stationary distribution. [sent-179, score-0.372]
39 This Markov chain is not recurrent, and thus, does not have a stationary distribution. [sent-182, score-0.323]
40 Then according to Theorem 1, pathwise LSTD is able to estimate the value function at the samples at a √ rate O(1/ n). [sent-187, score-0.272]
41 This implies that LSTD does not require the Markov chain to possess a stationary distribution. [sent-191, score-0.323]
42 Remark 5 The most critical part of the bound in Equation 1 is the inverse dependency on the smallest positive eigenvalue νn . [sent-192, score-0.206]
43 Furthermore, if the Markov chain admits a stationary distribution ρ, we are able to relate the existence of the LSTD solution to the smallest eigenvalue of the Gram matrix defined according to ρ (see Section 5. [sent-195, score-0.529]
44 n In the Markov design model considered in this lemma, states {Xt }t=1 are random variables generated according to the Markov chain and the noise terms ξt may depend on the next state Xt+1 (but should be centered conditioned on the past states X1 , . [sent-256, score-0.18]
45 Generalization Bounds As we pointed out earlier, Theorem 1 makes no assumption on the existence of the stationary distribution of the Markov chain. [sent-272, score-0.291]
46 This generality comes at the cost that the performance is evaluated only at the states visited by the Markov chain and no generalization on other states is possible. [sent-273, score-0.162]
47 However in many problems of interest, the Markov chain has a stationary distribution ρ, and thus, the performance may be generalized to the whole state space under the measure ρ. [sent-274, score-0.344]
48 Moreover, if ρ exists, it is possible to derive a condition for the existence of the pathwise LSTD solution depending on the number of samples and the smallest eigenvalue of the Gram matrix defined according to ρ ; G ∈ Rd×d , Gi j = ϕi (x)ϕ j (x)ρ(dx). [sent-275, score-0.458]
49 If ρ is the stationary distribution of the Markov chain, we define the orthogonal projection operator Π : B (X ;Vmax ) → F as ΠV = arg min ||V − f ||ρ . [sent-279, score-0.33]
50 Finally, we should guarantee that the pathwise LSTD solution V is uniformly bounded on X . [sent-284, score-0.245]
51 (12) In the next sections, we present conditions on the existence of the pathwise LSTD solution and derive generalization bounds under different assumptions on the way the samples X1 , . [sent-286, score-0.361]
52 More specifically, we show that if a large enough number of samples (depending on the smallest eigenvalue of G) is available, then the smallest eigenvalue of 1 ⊤ n Φ Φ is strictly positive with high probability. [sent-292, score-0.265]
53 , Xn be a trajectory of length n of a stationary β-mixing process with ¯ parameters β, b, κ and stationary distribution ρ. [sent-297, score-0.582]
54 3051 1/κ ≥ √ ω||α|| , L AZARIC , G HAVAMZADEH AND M UNOS Remark 1 In order to make the condition on the number of samples and its dependency on the critical parameters of the problem at hand more explicit, let us consider the case of a stationary process with b = β = κ = 1. [sent-323, score-0.341]
55 (d + 1) log ω δ As can be seen, the number of samples needed to have strictly positive eigenvalues in the samplebased Gram matrix has an inverse dependency on the smallest eigenvalue of G. [sent-325, score-0.25]
56 2 Generalization Bounds for Stationary β-mixing Processes In this section, we show how Theorem 1 may be generalized to the entire state space X when the Markov chain M π has a stationary distribution ρ. [sent-328, score-0.344]
57 , Xn are obtained by following a single trajectory in the stationary regime of M π , that is, when we consider that X1 is drawn from ρ. [sent-332, score-0.321]
58 , Xn be a path generated by a stationary β-mixing process with parameters ¯ β, b, κ and stationary distribution ρ. [sent-336, score-0.521]
59 They consider a general setting in which the function space F is bounded and the performance of the algorithm is evaluated according to an arbitrary measure µ (possibly different than the stationary distribution of the Markov chain ρ). [sent-357, score-0.344]
60 (2008), samples are drawn from a stationary β-mixing process, however, F is a linear space and ρ is the stationary distribution of the Markov chain. [sent-362, score-0.517]
61 , Xn is generated by a stationary β-mixing process with stationary distribution ρ. [sent-377, score-0.504]
62 This is possible if we consider samples of a Markov chain during its stationary regime, that is, X1 ∼ ρ. [sent-378, score-0.369]
63 , Xn is no longer a realization of a stationary β-mixing process. [sent-383, score-0.225]
64 Nonetheless, under suitable conditions, after n < n steps, the distribution of Xn approaches the stationary distribution ρ. [sent-384, score-0.267]
65 3 We now derive a bound for a modification of pathwise LSTD in which the first n samples (that ˜ are used to burn the chain) are discarded and the remaining n − n samples are used as training ˜ samples for the algorithm. [sent-389, score-0.432]
66 , Xn be a trajectory generated by a β-mixing Markov chain with parameters ¯ β, b, κ and stationary distribution ρ. [sent-393, score-0.458]
67 Remark 1 The bound in Equation 19 indicates that in the case of β-mixing Markov chains, a similar performance to the one for stationary β-mixing processes is obtained by discarding the first n = O(log n) samples. [sent-401, score-0.274]
68 Finite-Sample Analysis of LSPI In the previous sections we studied the performance of pathwise-LSTD for policy evaluation. [sent-403, score-0.213]
69 Now we move to the analysis of the least-squares policy iteration (LSPI) algorithm (Lagoudakis and Parr, 2003) in which at each iteration k samples are collected by following a single trajectory of the 3. [sent-404, score-0.423]
70 3054 F INITE -S AMPLE A NALYSIS OF L EAST-S QUARES P OLICY I TERATION policy under evaluation, πk , and LSTD is used to compute an approximation of V πk . [sent-406, score-0.229]
71 In particular, in the next section we report a performance bound by comparing the value of the policy returned by the algorithm after K iterations, V πK , and the optimal value function, V ∗ , w. [sent-407, score-0.262]
72 We first introduce the greedy policy operator G that maps value functions to their corresponding greedy policies: G (V ) (x) = arg max r(x, a) + γ a∈A X P(dy|x, a)V (y) . [sent-416, score-0.335]
73 LSPI is a policy iteration algorithm that uses LSTD for policy evaluation at each iteration. [sent-421, score-0.452]
74 It starts with an arbitrary initial value function V−1 ∈ F and its corresponding greedy policy π0 . [sent-422, score-0.251]
75 At the first iteration, it approximates V π0 using LSTD and returns a function V0 whose truncated version V0 is used to build the policy π1 for the second iteration. [sent-423, score-0.248]
76 So, at each iteration k of LSPI, a function Vk−1 is computed as an approximation to V πk−1 , and then truncated, Vk−1 , and used to build the policy πk = G (Vk−1 ). [sent-428, score-0.255]
77 Note that the MDP model is needed in order to generate the greedy policy πk . [sent-429, score-0.251]
78 Moreover, the LSTD performance bounds of Section 5 require the samples to be collected by following the policy under evaluation. [sent-434, score-0.297]
79 Assumption 1 (Lower-bounding distribution) There exists a distribution µ ∈ S (X ) such that for any policy π that is greedy w. [sent-436, score-0.272]
80 a function in the truncated space F , µ ≤ Cρπ , where C < ∞ is a constant and ρπ is the stationary distribution of policy π. [sent-439, score-0.494]
81 In fact, in the on-policy setting, if the policy under evaluation is deterministic, it does not provide any information about the value of actions a = π(·) and the policy improvement step would always fail. [sent-453, score-0.426]
82 Thus, we need to consider stochastic policies where the current policy is perturbed by an ε > 0 randomization which guarantees that any action has a non-zero probability to be selected in any state. [sent-454, score-0.273]
83 Lemma 7 Under Assumption 3, at each iteration k of LSPI, the smallest eigenvalue ωk of the Gram ω matrix Gk defined according to the stationary distribution ρk = ρπk is strictly positive and ωk ≥ Cµ . [sent-466, score-0.406]
84 Finally, we make the following assumption on the stationary β-mixing processes corresponding to the stationary distributions of the policies encountered at the iterations of the LSPI algorithm. [sent-471, score-0.51]
85 Assumption 4 (Slower β-mixing process) We assume that there exists a stationary β-mixing pro¯ cess with parameters β, b, κ, such that for any policy π that is greedy w. [sent-472, score-0.476]
86 a function in the truncated space F , it is slower than the stationary β-mixing process with stationary distribution ρπ (with pa¯ ¯ rameters βπ , bπ , κπ ). [sent-475, score-0.544]
87 Theorem 8 Let us assume that at each iteration k of the LSPI algorithm, a path of size n is generated from the stationary β-mixing process with stationary distribution ρk−1 = ρπk−1 . [sent-478, score-0.547]
88 , VK−1 ) be the sequence of value functions (truncated value functions) generated by LSPI after K iterations, and πK be the greedy policy w. [sent-486, score-0.269]
89 Unlike policy evaluation, the approximation error E0 (F ) now depends on how well the space F can approximate the target functions V π obtained in the policy improvement step. [sent-501, score-0.442]
90 While the estimation errors are mostly similar to those in policy evaluation, an additional term of order γK is introduced. [sent-502, score-0.213]
91 (2010) recently performed a refined analysis of the propagation of the error in approximate policy iteration and have interesting insights on the concentrability terms. [sent-505, score-0.295]
92 The analysis of LSTD explicitly requires that the samples are collected by following the policy under evaluation, πk , and the performance is bounded according to its stationary distribution ρk . [sent-507, score-0.521]
93 a target distribution σ, we need each of the policies encountered through the LSPI process to have a stationary distribution which does not differ too much from σ. [sent-511, score-0.326]
94 Furthermore, since the policies are random (at each iteration k the new policy πk is greedy w. [sent-512, score-0.321]
95 the approximation Vk−1 which is random because of the sampled trajectory), we need to consider the distance of σ and the stationary distribution of any possible policy generated as greedy w. [sent-515, score-0.531]
96 Thus in Assumption 1 we first assume the existence of a distribution µ lower-bounding any possible stationary distribution ρk . [sent-519, score-0.296]
97 In this case, we show that the LSPI performance, when the samples at each iteration are generated according to the stationary distribution of the policy under evaluation, can be arbitrarily bad. [sent-523, score-0.549]
98 Each term ||α∗ || can be bounded whenever the k k k features of the space F are linearly independent according to the stationary distribution ρk . [sent-527, score-0.246]
99 Vk−1 , that is, πk = G (Vk−1 ) and ρπk be the stationary distribution of the Markov chain induced by πk . [sent-536, score-0.344]
100 For any distribution σ ∈ S (X ), we may write ||Vk − T πk Vk ||σ = ||(I − γPπk )(Vk −V πk )||σ ≤ ||I − γPπk ||σ ||Vk −V πk ||σ ≤ 1 + γ||Pπk ||σ ||Vk −V πk ||σ If σ is the stationary distribution of πk , that is, σ = ρπk , then ||Pπk ||σ = 1 and the claim follows. [sent-539, score-0.267]
wordName wordTfidf (topN-words)
[('lstd', 0.721), ('lspi', 0.286), ('pathwise', 0.226), ('stationary', 0.225), ('policy', 0.213), ('vk', 0.203), ('vmax', 0.122), ('antos', 0.117), ('gram', 0.112), ('xt', 0.109), ('markov', 0.098), ('chain', 0.098), ('trajectory', 0.096), ('azaric', 0.091), ('havamzadeh', 0.091), ('quares', 0.081), ('unos', 0.078), ('bellman', 0.075), ('olicy', 0.069), ('teration', 0.069), ('inite', 0.063), ('eigenvalue', 0.057), ('ample', 0.054), ('mbr', 0.051), ('bound', 0.049), ('mdp', 0.048), ('operator', 0.046), ('samples', 0.046), ('smallest', 0.045), ('xn', 0.045), ('policies', 0.044), ('fn', 0.041), ('concentrability', 0.041), ('nalysis', 0.04), ('greedy', 0.038), ('truncated', 0.035), ('remark', 0.035), ('propagated', 0.035), ('lazaric', 0.035), ('dependency', 0.033), ('states', 0.032), ('parr', 0.031), ('inria', 0.031), ('nitesample', 0.03), ('tsitsiklis', 0.03), ('existence', 0.029), ('equation', 0.028), ('tv', 0.027), ('dy', 0.027), ('rmax', 0.027), ('theorem', 0.026), ('lagoudakis', 0.026), ('mohammad', 0.026), ('iteration', 0.026), ('roy', 0.025), ('ghavamzadeh', 0.023), ('orthogonal', 0.023), ('slower', 0.023), ('reward', 0.023), ('bounds', 0.022), ('critical', 0.022), ('eigenvalues', 0.022), ('cl', 0.022), ('alessandro', 0.022), ('rn', 0.021), ('bertsekas', 0.021), ('distribution', 0.021), ('deduce', 0.02), ('lemma', 0.02), ('yt', 0.02), ('discount', 0.019), ('derive', 0.019), ('nonetheless', 0.019), ('solution', 0.019), ('invertible', 0.018), ('generated', 0.018), ('admits', 0.018), ('farahmand', 0.017), ('rl', 0.017), ('path', 0.017), ('matrix', 0.017), ('truncation', 0.016), ('chains', 0.016), ('assumption', 0.016), ('collected', 0.016), ('probability', 0.016), ('martingale', 0.016), ('approximation', 0.016), ('azuma', 0.016), ('ltration', 0.016), ('rt', 0.016), ('strictly', 0.015), ('rd', 0.015), ('fr', 0.015), ('insights', 0.015), ('process', 0.015), ('projection', 0.015), ('log', 0.015), ('uniqueness', 0.015), ('munos', 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
2 0.14322579 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
3 0.10378312 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
4 0.063392609 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
Author: Mehrdad Mahdavi, Rong Jin, Tianbao Yang
Abstract: In this paper we propose efficient algorithms for solving constrained online convex optimization problems. Our motivation stems from the observation that most algorithms proposed for online convex optimization require a projection onto the convex set K from which the decisions are made. While the projection is straightforward for simple shapes (e.g., Euclidean ball), for arbitrary complex sets it is the main computational challenge and may be inefficient in practice. In this paper, we consider an alternative online convex optimization problem. Instead of requiring that decisions belong to K for all rounds, we only require that the constraints, which define the set K , be satisfied in the long run. By turning the problem into an online convex-concave optimization problem, √ we propose an efficient algorithm which achieves O( T ) regret bound and O(T 3/4 ) bound on the violation of constraints. Then, we modify the algorithm in order to guarantee that the constraints are satisfied in the long run. This gain is achieved at the price of getting O(T 3/4 ) regret bound. Our second algorithm is based on the mirror prox method (Nemirovski, 2005) to solve variational inequalities which achieves O(T 2/3 ) bound for both regret and the violation of constraints when the domain K can be described by a finite number of linear constraints. Finally, we extend the results to the setting where we only have partial access to the convex set K and propose a multipoint bandit feedback algorithm with the same bounds in expectation as our first algorithm. Keywords: online convex optimization, convex-concave optimization, bandit feedback, variational inequality
5 0.055982113 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
6 0.051531199 59 jmlr-2012-Linear Regression With Random Projections
7 0.050260246 58 jmlr-2012-Linear Fitted-Q Iteration with Multiple Reward Functions
8 0.043462347 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
9 0.041060865 84 jmlr-2012-Online Submodular Minimization
10 0.04072113 97 jmlr-2012-Regularization Techniques for Learning with Matrices
11 0.038318288 116 jmlr-2012-Transfer in Reinforcement Learning via Shared Features
12 0.036932148 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
13 0.03199799 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
14 0.030797178 25 jmlr-2012-Characterization and Greedy Learning of Interventional Markov Equivalence Classes of Directed Acyclic Graphs
15 0.028713077 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
16 0.026527578 80 jmlr-2012-On Ranking and Generalization Bounds
17 0.025416266 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
18 0.02537936 41 jmlr-2012-Exploration in Relational Domains for Model-based Reinforcement Learning
19 0.024834663 52 jmlr-2012-Iterative Reweighted Algorithms for Matrix Rank Minimization
20 0.024642929 26 jmlr-2012-Coherence Functions with Applications in Large-Margin Classification Methods
topicId topicWeight
[(0, -0.138), (1, -0.069), (2, -0.031), (3, -0.147), (4, 0.036), (5, -0.137), (6, -0.197), (7, -0.104), (8, -0.025), (9, 0.067), (10, -0.15), (11, 0.086), (12, 0.028), (13, -0.024), (14, -0.002), (15, -0.04), (16, 0.028), (17, -0.072), (18, -0.053), (19, 0.138), (20, -0.069), (21, -0.052), (22, 0.056), (23, 0.025), (24, -0.163), (25, -0.074), (26, 0.108), (27, -0.205), (28, 0.071), (29, 0.056), (30, 0.046), (31, -0.152), (32, -0.035), (33, -0.107), (34, 0.05), (35, -0.09), (36, -0.115), (37, 0.13), (38, -0.088), (39, 0.121), (40, -0.053), (41, -0.081), (42, -0.035), (43, -0.228), (44, 0.121), (45, -0.093), (46, 0.169), (47, 0.026), (48, -0.139), (49, -0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.91972417 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
2 0.75265229 34 jmlr-2012-Dynamic Policy Programming
Author: Mohammad Gheshlaghi Azar, Vicenç Gómez, Hilbert J. Kappen
Abstract: In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. DPP is an incremental algorithm that forces a gradual change in policy update. This allows us to prove finite-iteration and asymptotic ℓ∞ -norm performance-loss bounds in the presence of approximation/estimation error which depend on the average accumulated error as opposed to the standard bounds which are expressed in terms of the supremum of the errors. The dependency on the average error is important in problems with limited number of samples per iteration, for which the average of the errors can be significantly smaller in size than the supremum of the errors. Based on these theoretical results, we prove that a sampling-based variant of DPP (DPP-RL) asymptotically converges to the optimal policy. Finally, we illustrate numerically the applicability of these results on some benchmark problems and compare the performance of the approximate variants of DPP with some existing reinforcement learning (RL) methods. Keywords: approximate dynamic programming, reinforcement learning, Markov decision processes, Monte-Carlo methods, function approximation
3 0.53242975 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
Author: Aviv Tamar, Dotan Di Castro, Ron Meir
Abstract: In reinforcement learning an agent uses online feedback from the environment in order to adaptively select an effective policy. Model free approaches address this task by directly mapping environmental states to actions, while model based methods attempt to construct a model of the environment, followed by a selection of optimal actions based on that model. Given the complementary advantages of both approaches, we suggest a novel procedure which augments a model free algorithm with a partial model. The resulting hybrid algorithm switches between a model based and a model free mode, depending on the current state and the agent’s knowledge. Our method relies on a novel definition for a partially known model, and an estimator that incorporates such knowledge in order to reduce uncertainty in stochastic approximation iterations. We prove that such an approach leads to improved policy evaluation whenever environmental knowledge is available, without compromising performance when such knowledge is absent. Numerical simulations demonstrate the effectiveness of the approach on policy gradient and Q-learning algorithms, and its usefulness in solving a call admission control problem. Keywords: reinforcement learning, temporal difference, stochastic approximation, markov decision processes, hybrid model based model free algorithms
4 0.28996298 86 jmlr-2012-Optimistic Bayesian Sampling in Contextual-Bandit Problems
Author: Benedict C. May, Nathan Korda, Anthony Lee, David S. Leslie
Abstract: In sequential decision problems in an unknown environment, the decision maker often faces a dilemma over whether to explore to discover more about the environment, or to exploit current knowledge. We address the exploration-exploitation dilemma in a general setting encompassing both standard and contextualised bandit problems. The contextual bandit problem has recently resurfaced in attempts to maximise click-through rates in web based applications, a task with significant commercial interest. In this article we consider an approach of Thompson (1933) which makes use of samples from the posterior distributions for the instantaneous value of each action. We extend the approach by introducing a new algorithm, Optimistic Bayesian Sampling (OBS), in which the probability of playing an action increases with the uncertainty in the estimate of the action value. This results in better directed exploratory behaviour. We prove that, under unrestrictive assumptions, both approaches result in optimal behaviour with respect to the average reward criterion of Yang and Zhu (2002). We implement OBS and measure its performance in simulated Bernoulli bandit and linear regression domains, and also when tested with the task of personalised news article recommendation on a Yahoo! Front Page Today Module data set. We find that OBS performs competitively when compared to recently proposed benchmark algorithms and outperforms Thompson’s method throughout. Keywords: multi-armed bandits, contextual bandits, exploration-exploitation, sequential allocation, Thompson sampling
5 0.25027466 59 jmlr-2012-Linear Regression With Random Projections
Author: Odalric-Ambrym Maillard, Rémi Munos
Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction
7 0.21435893 105 jmlr-2012-Selective Sampling and Active Learning from Single and Multiple Teachers
8 0.21345352 115 jmlr-2012-Trading Regret for Efficiency: Online Convex Optimization with Long Term Constraints
9 0.198696 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
10 0.19049275 76 jmlr-2012-Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics
11 0.17445612 118 jmlr-2012-Variational Multinomial Logit Gaussian Process
12 0.16044757 4 jmlr-2012-A Kernel Two-Sample Test
13 0.1557686 111 jmlr-2012-Structured Sparsity and Generalization
14 0.14521337 84 jmlr-2012-Online Submodular Minimization
15 0.13991575 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
16 0.13771509 97 jmlr-2012-Regularization Techniques for Learning with Matrices
17 0.13742009 48 jmlr-2012-High-Dimensional Gaussian Graphical Model Selection: Walk Summability and Local Separation Criterion
18 0.1320129 66 jmlr-2012-Metric and Kernel Learning Using a Linear Transformation
19 0.12906039 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
20 0.12773262 2 jmlr-2012-A Comparison of the Lasso and Marginal Regression
topicId topicWeight
[(16, 0.347), (21, 0.043), (26, 0.032), (29, 0.037), (56, 0.011), (57, 0.015), (64, 0.016), (69, 0.012), (75, 0.05), (79, 0.011), (92, 0.235), (96, 0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.75042468 46 jmlr-2012-Finite-Sample Analysis of Least-Squares Policy Iteration
Author: Alessandro Lazaric, Mohammad Ghavamzadeh, Rémi Munos
Abstract: In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is β-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm. Keywords: Markov decision processes, reinforcement learning, least-squares temporal-difference, least-squares policy iteration, generalization bounds, finite-sample analysis
2 0.69135737 11 jmlr-2012-A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models
Author: Neil D. Lawrence
Abstract: We introduce a new perspective on spectral dimensionality reduction which views these methods as Gaussian Markov random fields (GRFs). Our unifying perspective is based on the maximum entropy principle which is in turn inspired by maximum variance unfolding. The resulting model, which we call maximum entropy unfolding (MEU) is a nonlinear generalization of principal component analysis. We relate the model to Laplacian eigenmaps and isomap. We show that parameter fitting in the locally linear embedding (LLE) is approximate maximum likelihood MEU. We introduce a variant of LLE that performs maximum likelihood exactly: Acyclic LLE (ALLE). We show that MEU and ALLE are competitive with the leading spectral approaches on a robot navigation visualization and a human motion capture data set. Finally the maximum likelihood perspective allows us to introduce a new approach to dimensionality reduction based on L1 regularization of the Gaussian random field via the graphical lasso.
3 0.58847827 59 jmlr-2012-Linear Regression With Random Projections
Author: Odalric-Ambrym Maillard, Rémi Munos
Abstract: We investigate a method for regression that makes use of a randomly generated subspace GP ⊂ F (of finite dimension P) of a given large (possibly infinite) dimensional function space F , for example, L2 ([0, 1]d ; R). GP is defined as the span of P random features that are linear combinations of a basis functions of F weighted by random Gaussian i.i.d. coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GP rather than in F , and derive excess risk bounds for a specific regression algorithm (least squares regression in GP ). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments. Keywords: regression, random matrices, dimension reduction
Author: Alain Hauser, Peter Bühlmann
Abstract: The investigation of directed acyclic graphs (DAGs) encoding the same Markov property, that is the same conditional independence relations of multivariate observational distributions, has a long tradition; many algorithms exist for model selection and structure learning in Markov equivalence classes. In this paper, we extend the notion of Markov equivalence of DAGs to the case of interventional distributions arising from multiple intervention experiments. We show that under reasonable assumptions on the intervention experiments, interventional Markov equivalence defines a finer partitioning of DAGs than observational Markov equivalence and hence improves the identifiability of causal models. We give a graph theoretic criterion for two DAGs being Markov equivalent under interventions and show that each interventional Markov equivalence class can, analogously to the observational case, be uniquely represented by a chain graph called interventional essential graph (also known as CPDAG in the observational case). These are key insights for deriving a generalization of the Greedy Equivalence Search algorithm aimed at structure learning from interventional data. This new algorithm is evaluated in a simulation study. Keywords: causal inference, interventions, graphical model, Markov equivalence, greedy equivalence search
5 0.58552676 99 jmlr-2012-Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise
Author: Sahand Negahban, Martin J. Wainwright
Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the “spikiness” and “low-rankness” of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with ℓq -“balls” of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. Keywords: matrix completion, collaborative filtering, convex optimization
6 0.58525711 51 jmlr-2012-Integrating a Partial Model into Model Free Reinforcement Learning
7 0.56227297 67 jmlr-2012-Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming
8 0.55432272 34 jmlr-2012-Dynamic Policy Programming
9 0.53844428 111 jmlr-2012-Structured Sparsity and Generalization
10 0.52844739 68 jmlr-2012-Minimax Manifold Estimation
11 0.52800274 82 jmlr-2012-On the Necessity of Irrelevant Variables
12 0.52664703 29 jmlr-2012-Consistent Model Selection Criteria on High Dimensions
13 0.51606798 7 jmlr-2012-A Multi-Stage Framework for Dantzig Selector and LASSO
14 0.51566964 73 jmlr-2012-Multi-task Regression using Minimal Penalties
15 0.51345181 96 jmlr-2012-Refinement of Operator-valued Reproducing Kernels
16 0.51142907 8 jmlr-2012-A Primal-Dual Convergence Analysis of Boosting
17 0.50970644 13 jmlr-2012-Active Learning via Perfect Selective Classification
18 0.5086143 80 jmlr-2012-On Ranking and Generalization Bounds
19 0.50830203 15 jmlr-2012-Algebraic Geometric Comparison of Probability Distributions
20 0.50734055 64 jmlr-2012-Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning