jmlr jmlr2008 jmlr2008-95 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
Reference: text
sentIndex sentText sentNum sentScore
1 First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. [sent-6, score-0.201]
2 Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. [sent-7, score-0.187]
3 In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. [sent-9, score-0.299]
4 A general relaxed convergence theorem for stochastic iterative algorithms is presented. [sent-11, score-0.26]
5 Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms 1. [sent-14, score-0.268]
6 Several solution methods are known, for example, from the field of [neuro-]dynamic programming (NDP) or reinforcement learning (RL), which compute or approximate the optimal control policy of an MDP. [sent-18, score-0.408]
7 a a a a o ´ C S AJI AND M ONOSTORI The dynamics of (Markovian) control problems can often be formulated as follows: xt+1 = f (xt , at , wt ), (1) where xt is the state of the system at time t ∈ N, at is a control action and wt is some disturbance. [sent-26, score-0.472]
8 There is also a cost function g(xt , at ) and the aim is to find an optimal control policy that minimizes the [discounted] costs over time (the next section will contain the basic definitions). [sent-27, score-0.372]
9 In many applications the calculation of a control policy should be fast and, additionally, environmental changes should also be taken into account. [sent-28, score-0.431]
10 In most control applications during the computation of a control policy the system uses a model of the environment. [sent-30, score-0.356]
11 The changing of the dynamics can also be modeled as an MDP, however, including environmental changes as a higher level MDP very likely leads to problems which do not have any practically efficient solution methods. [sent-34, score-0.19]
12 The paper argues that if the model was “close” to the environment, then a “good” policy based on the model cannot be arbitrarily “wrong” from the viewpoint of the environment and, moreover, “slight” changes in the environment result only in “slight” changes in the optimal cost-to-go function. [sent-35, score-0.534]
13 Then, we apply that framework to deduce an approximate convergence theorem for time-dependent stochastic iterative algorithms. [sent-40, score-0.208]
14 Later, with the help of this general theorem, we show relaxed convergence properties (more precisely, κ-approximation) for value function based reinforcement learning methods working in (ε, δ)-MDPs. [sent-41, score-0.217]
15 We also present value function bounds (Theorem 13) for the case of changes in the discount factor and demonstrate that this dependence is not Lipschitz continuous. [sent-46, score-0.264]
16 In this model a the transition function and the cost function may change over time, provided that the accumulated changes remain bounded in the limit. [sent-51, score-0.215]
17 We show (Lemma 18) that potential changes in the discount factor can be incorporated into the immediate-cost function, thus, we do not have to consider discount factor changes. [sent-52, score-0.451]
18 As a corollary, we get an approximation theorem for value function based reinforcement learning methods in (ε, δ)-MDPs (Corollary 21). [sent-56, score-0.19]
19 Regarding learning, we illustrate the effects of environmental changes through two problems: scheduling and grid world. [sent-64, score-0.261]
20 The agent receives information about the state of the environment x, at each state x the agent is allowed to choose an action a ∈ A (x). [sent-75, score-0.393]
21 After the action is selected, the environment moves to the next state according to the probability distribution p(x, a) and the decision-maker collects its one-step cost, g(x, a). [sent-76, score-0.207]
22 Definition 2 A (stationary, Markovian) control policy determines the action to take in each state. [sent-78, score-0.353]
23 a∈A 1681 ´ C S AJI AND M ONOSTORI Definition 3 The value or cost-to-go function of a policy π is a function from states to costs, J π : X → R. [sent-84, score-0.212]
24 A control policy is (uniformly) optimal if it is less than or equal to all other control policies. [sent-89, score-0.356]
25 There always exists at least one optimal policy (Sutton and Barto, 1998). [sent-90, score-0.212]
26 It is known that the Bellman operator is a supremum norm contraction with Lipschitz constant α. [sent-100, score-0.279]
27 In case we consider stochastic shortest path (SSP) problems, which arise if the MDP has an absorbing terminal (goal) state, then the Bellman operator becomes a pseudo-contraction in the weighted supremum norm (Bertsekas and Tsitsiklis, 1996). [sent-101, score-0.252]
28 From a given value function J, it is straightforward to get a policy, for example, by applying a greedy and deterministic policy (w. [sent-102, score-0.212]
29 If J is “close” to J ∗ , then the greedy policy with one-stage lookahead based on J will also be “close” to an optimal policy, as it was proven by Bertsekas and Tsitsiklis (1996): Theorem 6 Let M be a discounted MDP and J is an arbitrary value function. [sent-109, score-0.371]
30 The value function of the greedy policy based on J is denoted by J π . [sent-110, score-0.249]
31 Consequently, if we could obtain a good approximation of the optimal value function, then we immediately had a good control policy, as well, for example, the greedy policy with respect to our approximate value function. [sent-113, score-0.284]
32 (3) t→∞ Hence, the “meaning” of this definition is that sequence Xt converges almost surely to an environment of X and the radius of this environment is less than or equal to a given constant κ. [sent-125, score-0.199]
33 Consider, for exama ple, the SARSA (state-action-reward-state-action) algorithm which is a model-free policy evaluation method. [sent-130, score-0.212]
34 Definition 8 We say that the operator sequence Ht κ-approximates operator H : V → V at V ∈ V if for any initial V0 ∈ V the sequence Vt+1 = Ht (Vt ,V ) κ-approximates HV . [sent-134, score-0.228]
35 a Theorem 10 Assume that two discounted MDPs differ only in their transition functions, denoted ∗ ∗ by p1 and p2 . [sent-170, score-0.335]
36 Let the corresponding optimal value functions be J1 and J2 , then ∗ ∗ J1 − J2 ∞ ≤ nα g ∞ p1 − p 2 (1 − α)2 ∞, recall that n is the size of the state space and α ∈ [0, 1) is the discount rate. [sent-171, score-0.277]
37 Theorem 12 Assume that two discounted MDPs differ only in the immediate-costs functions, g 1 ∗ ∗ and g2 . [sent-192, score-0.21]
38 3 Changes in the Discount Factor The following theorem shows that the change of the value function can also be estimated in case there were changes in the discount rate (all proofs can be found in the appendix). [sent-195, score-0.33]
39 Theorem 13 Assume that two discounted MDPs differ only in the discount factors, denoted by ∗ ∗ α1 , α2 ∈ [0, 1). [sent-196, score-0.434]
40 Suppose that the MDP has discount factor ∗ α1 = 0, thus, J1 (x) = 1. [sent-201, score-0.187]
41 Now, if we change the discount rate to α2 ∈ (0, 1), then |α1 − α2 | < 1 but ∗ − J∗ ∗ J1 2 ∞ could be arbitrarily large, since J2 (x) → ∞ as α2 → 1. [sent-202, score-0.187]
42 Lemma 14 Assume that we have two discounted MDPs which differ only in the transition-probability functions or only in the immediate-cost functions or only in the discount factors. [sent-211, score-0.469]
43 5 Further Remarks on Inaccurate Models In this section we saw that the optimal value function of a discounted MDP depends smoothly on the transition function, the cost function and the discount rate. [sent-215, score-0.484]
44 Later, we will see that changes in the discount rate can be traced back to changes in the cost function (Lemma 18), therefore, it is sufficient to consider transition and cost changes. [sent-219, score-0.559]
45 Corollary 15 Assume that two discounted MDPs (E and M) differ only in their transition functions and their cost functions. [sent-221, score-0.384]
46 Let the corresponding transition and cost functions be denoted by p E , pM ∗ ∗ and gE , gM , respectively. [sent-222, score-0.211]
47 The value function in E of the deterministic and greedy policy (π) with one stage-lookahead that is ∗ π based upon JM is denoted by JE . [sent-224, score-0.249]
48 Another interesting question is the effects of environmental changes on the value function of a fixed control policy. [sent-227, score-0.219]
49 Note that the presented theorems are only valid in case of discounted MDPs. [sent-229, score-0.201]
50 Though, a large part of the MDP related research studies the expected total discounted cost optimality criterion, in 1687 ´ C S AJI AND M ONOSTORI some cases discounting is inappropriate and, therefore, there are alternative optimality approaches, as well. [sent-230, score-0.209]
51 Regarding the validity of the results of Section 4 concerning MDPs with the expected average cost minimization objective, we can recall that, in the case of finite MDPs, discounted cost offers a good approximation to the other optimality criterion. [sent-233, score-0.259]
52 More precisely, it can be shown that there exists a large enough α0 < 1 such that ∀ α ∈ (α0 , 1) optimal control policies for the discounted cost problem are also optimal for the average cost problem (Feinberg and Shwartz, 2002). [sent-234, score-0.331]
53 a ∞ Definition 16 A sequence of MDPs (Mt )t=1 is called an ε-MDP with ε > 0 if the MDPs differ only in their transition-probability functions, denoted by pt for Mt , and there exists an MDP with transition function p, called the base MDP, such that supt p − pt ≤ ε. [sent-244, score-0.334]
54 lim sup g − gt ≤ δ t→∞ The optimal value function of the base MDP and of the current MDP at time t (which MDP has transition function pt and cost function gt ) are denoted by J ∗ and Jt∗ , respectively. [sent-252, score-0.451]
55 Lemma 18 Assume that two discounted MDPs, M1 and M2 , differ only in the discount factors, denoted by α1 and α2 . [sent-254, score-0.434]
56 Then, there exists an MDP, denoted by M3 , such that it differs only in the immediate-cost function from M1 , thus its discount factor is α1 , and it has the same optimal value function as M2 . [sent-255, score-0.224]
57 On the other hand, we can notice that changes in the cost function cannot be traced back to changes in the transition function. [sent-257, score-0.322]
58 2 Relaxed Convergence of Stochastic Iterative Algorithms In this section we present a general relaxed convergence theorem for a large class of stochastic iterative algorithms. [sent-268, score-0.26]
59 Later, we will apply this theorem to investigate the convergence properties of value function based reinforcement learning methods in (ε, δ)-MDPs. [sent-269, score-0.231]
60 Assumption 3 For all t, operator Kt : V → V is a supremum norm contraction mapping with Lipschitz constant βt < 1 and with fixed point Vt∗ . [sent-296, score-0.279]
61 Now, we present a theorem (its proof can be found in the appendix) that shows how the function sequence generated by iteration (6) can converge to an environment of a function. [sent-302, score-0.217]
62 1 A S IMPLE N UMERICAL E XAMPLE Consider a one dimensional stochastic process characterized by the iteration vt+1 = (1 − γt )vt + γt (Kt (vt ) + wt ), (7) where γt is the learning rate and wt is a noise term. [sent-312, score-0.261]
63 One may think that since each ki attracts the point towards its fixed point, the sequence vt converges to the convex hull of the fixed points. [sent-332, score-0.488]
64 the following argument shows that sequence vt cannot get arbitrarily far from the fixed points. [sent-346, score-0.427]
65 Furthermore, let β0 be i j defined as β0 = maxi bi and dt as dt = mini v∗ − vt . [sent-349, score-0.488]
66 Consequently, if vt somehow got farther than φ(β0 , ρ), in the next step it would inevitably be attracted towards the fixed points. [sent-354, score-0.431]
67 3 Reinforcement Learning in (ε, δ)-MDPs In case of finite (ε, δ)-MDPs we can formulate a relaxed convergence theorem for value function based reinforcement learning algorithms, as a corollary of Theorem 20. [sent-359, score-0.283]
68 Assume, for example, that we use the supremum norm, · ∞ , for cost functions and · 1 , defined by Equation (5), for transition functions. [sent-363, score-0.241]
69 1 − β0 Notice that as parameters ε and δ go to zero, we get back to a classical convergence theorem for this kind of stochastic iterative algorithm (still in a little bit generalized form, since βt might still change over time). [sent-369, score-0.208]
70 In the case of (ε, δ)-MDPs a small stepsize variant of asynchronous value iteration can be defined as Jt+1 (x) = (1 − γt (x))Jt (x) + γt (x)(Tt Jt )(x), where Tt is the Bellman operator of the current MDP at time t. [sent-377, score-0.238]
71 Assumption 3 follows from the fact that each Tt operator is an α contraction where α is the discount factor. [sent-379, score-0.36]
72 3 T EMPORAL D IFFERENCE L EARNING IN (ε, δ)-MDP S Temporal difference learning, or for short TD-learning, is a policy evaluation algorithm. [sent-394, score-0.212]
73 It aims at finding the corresponding value function J π for a given control policy π (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998). [sent-395, score-0.284]
74 It can also be used for approximating the optimal value function, for example, if we apply it together with the policy iteration algorithm. [sent-396, score-0.248]
75 The proof is based on the observation that iteration (10) can be seen as a RobbinsMonro type stochastic iterative algorithm for finding the fixed point of J π = HJ π , where H is a contraction mapping with Lipschitz constant α (Bertsekas and Tsitsiklis, 1996). [sent-407, score-0.227]
76 Therefore, we can apply Theorem 20 to achieve a relaxed convergence result for off-line first-visit TD(λ) in changing environments under the same conditions as in the case of ordinary MDPs. [sent-410, score-0.235]
77 Operator Bt is time-dependent since, for example, if we model an approximate version of optimistic policy iteration, then in each iteration the control policy changes and, therefore, the update operator changes, as well. [sent-429, score-0.692]
78 Suppose that Assumptions 1-2 hold and each Bt is a contraction as well as Π is linear and supremum norm nonexpansive, then Φ(rt ) κ-approximates V ∗ for any initial r0 with κ = 4ρ/(1 − β0 ), where ρ = lim supt→∞ Vt∗ −V ∗ . [sent-444, score-0.252]
79 In our randomized grid worlds the agent was taken to a random surrounding state (no matter what action it chose) with probability η and this probability changed after each step. [sent-491, score-0.231]
80 Table 1 was generated using an (optimistic) SARSA algorithm, namely, the current policy was evaluated by SARSA, then the policy was (optimistically) improved, more precisely, the greedy policy with respect to the achieved evaluation was calculated. [sent-497, score-0.636]
81 That policy was also soft, namely, it made random explorations with probability 0. [sent-498, score-0.212]
82 An explanation could be that large changes in the transitions cause the agent to loose control over the events, since it becomes very hard to predict the effects of the actions and, hence, to estimate the expected costs. [sent-654, score-0.246]
83 In the paper we have argued that the optimal value function of a (discounted) MDP Lipschitz continuously depends on the transition-probability function and the immediate-cost function, therefore, small changes in the environment result only in small changes in the optimal value function. [sent-659, score-0.238]
84 A bound for changes in the discount factor was also proven, and it was demonstrated that, in general, this dependence was not Lipschitz continuous. [sent-661, score-0.264]
85 Additionally, it was shown that changes in the discount rate could be traced back to changes in the immediate-cost function. [sent-662, score-0.371]
86 Later, the effects of environmental changes on Q-learning based flexible job-shop scheduling was experimentally studied. [sent-671, score-0.219]
87 These results were illustrated through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning policy evaluation. [sent-676, score-0.303]
88 First, analyzing the effects of environmental changes on the value function in case of the expected average cost optimization criterion would be interesting. [sent-680, score-0.197]
89 Theorem 12 Assume that two discounted MDPs differ only in the immediate-cost functions, denoted ∗ ∗ by g1 and g2 . [sent-704, score-0.247]
90 Theorem 13 Assume that two discounted MDPs differ only in the discount factors, denoted by ∗ ∗ α1 , α2 ∈ [0, 1). [sent-712, score-0.434]
91 Proof Let π∗ denote a greedy and deterministic policy based on value function Ji∗ , where i ∈ {1, 2}. [sent-714, score-0.212]
92 i Naturally, policy π∗ is optimal if the discount rate is αi (Theorem 6). [sent-715, score-0.399]
93 Then, let us introduce a i ˆ deterministic Markovian control policy π defined as ∗ ∗ ∗ π1 (x) if J1 (x) ≤ J2 (x), ˆ π(x) = ∗ ∗ ∗ π2 (x) if J2 (x) < J1 (x). [sent-716, score-0.284]
94 Lemma 14 Assume that we have two discounted MDPs which differ only in the transition-probability functions or only in the immediate-cost functions or only in the discount factors. [sent-720, score-0.469]
95 Case 3: Assume that the MDPs differ only in the discount rates, denoted by α 1 and α2 . [sent-731, score-0.275]
96 Lemma 18 Assume that two discounted MDPs, M1 and M2 , differ only in the discount factors, denoted by α1 and α2 . [sent-735, score-0.434]
97 Then, there exists an MDP, denoted by M3 , such that it differs only in the immediate-cost function from M1 , thus its discount factor is α1 , and it has the same optimal value function as M2 . [sent-736, score-0.224]
98 n 0 During the proof we will apply the solutions of suitable finite horizon problems, thus, in order to avoid notational confusions, let us denote the optimal state and action value functions of M 2 and M3 by J ∗ , Q∗ and Jˆ∗ , Q∗ , respectively. [sent-751, score-0.219]
99 and t=0 t=0 Assumption 3 For all t, operator Kt : V → V is a supremum norm contraction mapping with Lipschitz constant βt < 1 and with fixed point Vt∗ . [sent-764, score-0.279]
100 Estimates for perturbations of general discounted Markov control chains. [sent-819, score-0.231]
wordName wordTfidf (topN-words)
[('vt', 0.396), ('mdp', 0.318), ('mdps', 0.249), ('kt', 0.245), ('policy', 0.212), ('discount', 0.187), ('discounted', 0.159), ('aji', 0.149), ('einforcement', 0.149), ('hanging', 0.149), ('nvironments', 0.149), ('onostori', 0.149), ('mins', 0.126), ('reinforcement', 0.124), ('ht', 0.124), ('jt', 0.113), ('lipschitz', 0.109), ('gt', 0.092), ('contraction', 0.09), ('qt', 0.09), ('tsitsiklis', 0.09), ('transition', 0.088), ('bertsekas', 0.087), ('sarsa', 0.084), ('environment', 0.084), ('bellman', 0.083), ('operator', 0.083), ('ft', 0.081), ('wt', 0.081), ('changes', 0.077), ('scheduling', 0.072), ('control', 0.072), ('environments', 0.07), ('environmental', 0.07), ('stepsize', 0.069), ('szita', 0.069), ('action', 0.069), ('supremum', 0.067), ('agent', 0.066), ('theorem', 0.066), ('stochastic', 0.063), ('jn', 0.062), ('rl', 0.062), ('horizon', 0.06), ('feinberg', 0.059), ('kalm', 0.059), ('szepesv', 0.059), ('lim', 0.056), ('earning', 0.056), ('adp', 0.055), ('supt', 0.055), ('state', 0.054), ('relaxed', 0.052), ('differ', 0.051), ('asynchronous', 0.05), ('monostori', 0.05), ('cost', 0.05), ('markovian', 0.048), ('sutton', 0.046), ('dt', 0.046), ('td', 0.045), ('shwartz', 0.045), ('tt', 0.045), ('changing', 0.043), ('xt', 0.043), ('theorems', 0.042), ('grid', 0.042), ('convergence', 0.041), ('temporal', 0.041), ('pathological', 0.04), ('xtk', 0.04), ('norm', 0.039), ('resource', 0.039), ('ji', 0.039), ('job', 0.039), ('costs', 0.038), ('iterative', 0.038), ('denoted', 0.037), ('iteration', 0.036), ('functions', 0.036), ('pt', 0.036), ('bt', 0.035), ('farther', 0.035), ('rt', 0.034), ('barto', 0.034), ('sgn', 0.034), ('stepsizes', 0.034), ('littman', 0.034), ('je', 0.034), ('xm', 0.032), ('hull', 0.032), ('arg', 0.032), ('actions', 0.031), ('sequence', 0.031), ('traced', 0.03), ('bal', 0.03), ('csan', 0.03), ('obsolete', 0.03), ('ordinary', 0.029), ('ki', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
2 0.23810366 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
3 0.10349227 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
Author: Avraham Bab, Ronen I. Brafman
Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the field, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to specific tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games
5 0.089740746 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
6 0.073963262 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
7 0.072738096 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
8 0.069995798 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
9 0.065989934 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
10 0.064408757 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
11 0.050205626 80 jmlr-2008-Ranking Individuals by Group Comparisons
12 0.049478874 52 jmlr-2008-Learning from Multiple Sources
13 0.046913434 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
14 0.042221934 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
15 0.036909446 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
16 0.036167145 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
17 0.035967678 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
18 0.035445694 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
19 0.035008579 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
20 0.032727126 92 jmlr-2008-Universal Multi-Task Kernels
topicId topicWeight
[(0, 0.194), (1, -0.059), (2, -0.09), (3, -0.075), (4, -0.292), (5, -0.396), (6, -0.191), (7, -0.06), (8, 0.01), (9, 0.005), (10, 0.007), (11, -0.136), (12, 0.049), (13, -0.023), (14, -0.159), (15, 0.013), (16, -0.147), (17, -0.102), (18, 0.041), (19, -0.044), (20, -0.014), (21, -0.174), (22, 0.066), (23, -0.024), (24, -0.1), (25, 0.093), (26, 0.009), (27, 0.012), (28, -0.18), (29, 0.029), (30, 0.026), (31, 0.061), (32, -0.039), (33, 0.059), (34, 0.088), (35, 0.046), (36, -0.103), (37, -0.002), (38, -0.05), (39, -0.032), (40, 0.006), (41, 0.079), (42, 0.039), (43, -0.008), (44, 0.015), (45, -0.063), (46, 0.117), (47, -0.079), (48, 0.052), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.96193469 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
2 0.76066577 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
3 0.44828382 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
4 0.39082047 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
Author: Sungwook Yoon, Alan Fern, Robert Givan
Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to find domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-specific knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to define features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search
Author: Avraham Bab, Ronen I. Brafman
Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the field, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to specific tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games
6 0.30927131 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
7 0.29705629 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
8 0.28999186 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
9 0.28944981 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
10 0.20801939 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations
11 0.20773335 80 jmlr-2008-Ranking Individuals by Group Comparisons
12 0.16991095 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.16680761 52 jmlr-2008-Learning from Multiple Sources
14 0.15962668 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
15 0.15172829 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression
16 0.14832622 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
17 0.14341339 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
18 0.13751444 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
19 0.13745639 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
20 0.13169909 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
topicId topicWeight
[(0, 0.023), (40, 0.017), (54, 0.618), (58, 0.028), (66, 0.032), (88, 0.075), (92, 0.037), (94, 0.041), (99, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.960329 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
2 0.89634627 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
3 0.83148581 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
Author: David Mease, Abraham Wyner
Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost
4 0.61197233 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
5 0.50450802 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction
6 0.49389341 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
7 0.45494014 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
8 0.44263968 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
10 0.41225359 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
11 0.41113004 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
12 0.41095775 83 jmlr-2008-Robust Submodular Observation Selection
13 0.40433496 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
14 0.40175211 56 jmlr-2008-Magic Moments for Structured Output Prediction
15 0.39810577 86 jmlr-2008-SimpleMKL
16 0.39015377 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
18 0.38135317 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
19 0.37877417 28 jmlr-2008-Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines
20 0.37523428 9 jmlr-2008-Active Learning by Spherical Subdivision