jmlr jmlr2008 jmlr2008-95 knowledge-graph by maker-knowledge-mining

95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Source: pdf

Author: Balázs Csanád Csáji, László Monostori

Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. [sent-6, score-0.201]

2 Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. [sent-7, score-0.187]

3 In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. [sent-9, score-0.299]

4 A general relaxed convergence theorem for stochastic iterative algorithms is presented. [sent-11, score-0.26]

5 Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms 1. [sent-14, score-0.268]

6 Several solution methods are known, for example, from the ﬁeld of [neuro-]dynamic programming (NDP) or reinforcement learning (RL), which compute or approximate the optimal control policy of an MDP. [sent-18, score-0.408]

7 a a a a o ´ C S AJI AND M ONOSTORI The dynamics of (Markovian) control problems can often be formulated as follows: xt+1 = f (xt , at , wt ), (1) where xt is the state of the system at time t ∈ N, at is a control action and wt is some disturbance. [sent-26, score-0.472]

8 There is also a cost function g(xt , at ) and the aim is to ﬁnd an optimal control policy that minimizes the [discounted] costs over time (the next section will contain the basic deﬁnitions). [sent-27, score-0.372]

9 In many applications the calculation of a control policy should be fast and, additionally, environmental changes should also be taken into account. [sent-28, score-0.431]

10 In most control applications during the computation of a control policy the system uses a model of the environment. [sent-30, score-0.356]

11 The changing of the dynamics can also be modeled as an MDP, however, including environmental changes as a higher level MDP very likely leads to problems which do not have any practically efﬁcient solution methods. [sent-34, score-0.19]

12 The paper argues that if the model was “close” to the environment, then a “good” policy based on the model cannot be arbitrarily “wrong” from the viewpoint of the environment and, moreover, “slight” changes in the environment result only in “slight” changes in the optimal cost-to-go function. [sent-35, score-0.534]

13 Then, we apply that framework to deduce an approximate convergence theorem for time-dependent stochastic iterative algorithms. [sent-40, score-0.208]

14 Later, with the help of this general theorem, we show relaxed convergence properties (more precisely, κ-approximation) for value function based reinforcement learning methods working in (ε, δ)-MDPs. [sent-41, score-0.217]

15 We also present value function bounds (Theorem 13) for the case of changes in the discount factor and demonstrate that this dependence is not Lipschitz continuous. [sent-46, score-0.264]

16 In this model a the transition function and the cost function may change over time, provided that the accumulated changes remain bounded in the limit. [sent-51, score-0.215]

17 We show (Lemma 18) that potential changes in the discount factor can be incorporated into the immediate-cost function, thus, we do not have to consider discount factor changes. [sent-52, score-0.451]

18 As a corollary, we get an approximation theorem for value function based reinforcement learning methods in (ε, δ)-MDPs (Corollary 21). [sent-56, score-0.19]

19 Regarding learning, we illustrate the effects of environmental changes through two problems: scheduling and grid world. [sent-64, score-0.261]

20 The agent receives information about the state of the environment x, at each state x the agent is allowed to choose an action a ∈ A (x). [sent-75, score-0.393]

21 After the action is selected, the environment moves to the next state according to the probability distribution p(x, a) and the decision-maker collects its one-step cost, g(x, a). [sent-76, score-0.207]

22 Deﬁnition 2 A (stationary, Markovian) control policy determines the action to take in each state. [sent-78, score-0.353]

23 a∈A 1681 ´ C S AJI AND M ONOSTORI Deﬁnition 3 The value or cost-to-go function of a policy π is a function from states to costs, J π : X → R. [sent-84, score-0.212]

24 A control policy is (uniformly) optimal if it is less than or equal to all other control policies. [sent-89, score-0.356]

25 There always exists at least one optimal policy (Sutton and Barto, 1998). [sent-90, score-0.212]

26 It is known that the Bellman operator is a supremum norm contraction with Lipschitz constant α. [sent-100, score-0.279]

27 In case we consider stochastic shortest path (SSP) problems, which arise if the MDP has an absorbing terminal (goal) state, then the Bellman operator becomes a pseudo-contraction in the weighted supremum norm (Bertsekas and Tsitsiklis, 1996). [sent-101, score-0.252]

28 From a given value function J, it is straightforward to get a policy, for example, by applying a greedy and deterministic policy (w. [sent-102, score-0.212]

29 If J is “close” to J ∗ , then the greedy policy with one-stage lookahead based on J will also be “close” to an optimal policy, as it was proven by Bertsekas and Tsitsiklis (1996): Theorem 6 Let M be a discounted MDP and J is an arbitrary value function. [sent-109, score-0.371]

30 The value function of the greedy policy based on J is denoted by J π . [sent-110, score-0.249]

31 Consequently, if we could obtain a good approximation of the optimal value function, then we immediately had a good control policy, as well, for example, the greedy policy with respect to our approximate value function. [sent-113, score-0.284]

32 (3) t→∞ Hence, the “meaning” of this deﬁnition is that sequence Xt converges almost surely to an environment of X and the radius of this environment is less than or equal to a given constant κ. [sent-125, score-0.199]

33 Consider, for exama ple, the SARSA (state-action-reward-state-action) algorithm which is a model-free policy evaluation method. [sent-130, score-0.212]

34 Deﬁnition 8 We say that the operator sequence Ht κ-approximates operator H : V → V at V ∈ V if for any initial V0 ∈ V the sequence Vt+1 = Ht (Vt ,V ) κ-approximates HV . [sent-134, score-0.228]

35 a Theorem 10 Assume that two discounted MDPs differ only in their transition functions, denoted ∗ ∗ by p1 and p2 . [sent-170, score-0.335]

36 Let the corresponding optimal value functions be J1 and J2 , then ∗ ∗ J1 − J2 ∞ ≤ nα g ∞ p1 − p 2 (1 − α)2 ∞, recall that n is the size of the state space and α ∈ [0, 1) is the discount rate. [sent-171, score-0.277]

37 Theorem 12 Assume that two discounted MDPs differ only in the immediate-costs functions, g 1 ∗ ∗ and g2 . [sent-192, score-0.21]

38 3 Changes in the Discount Factor The following theorem shows that the change of the value function can also be estimated in case there were changes in the discount rate (all proofs can be found in the appendix). [sent-195, score-0.33]

39 Theorem 13 Assume that two discounted MDPs differ only in the discount factors, denoted by ∗ ∗ α1 , α2 ∈ [0, 1). [sent-196, score-0.434]

40 Suppose that the MDP has discount factor ∗ α1 = 0, thus, J1 (x) = 1. [sent-201, score-0.187]

41 Now, if we change the discount rate to α2 ∈ (0, 1), then |α1 − α2 | < 1 but ∗ − J∗ ∗ J1 2 ∞ could be arbitrarily large, since J2 (x) → ∞ as α2 → 1. [sent-202, score-0.187]

42 Lemma 14 Assume that we have two discounted MDPs which differ only in the transition-probability functions or only in the immediate-cost functions or only in the discount factors. [sent-211, score-0.469]

43 5 Further Remarks on Inaccurate Models In this section we saw that the optimal value function of a discounted MDP depends smoothly on the transition function, the cost function and the discount rate. [sent-215, score-0.484]

44 Later, we will see that changes in the discount rate can be traced back to changes in the cost function (Lemma 18), therefore, it is sufﬁcient to consider transition and cost changes. [sent-219, score-0.559]

45 Corollary 15 Assume that two discounted MDPs (E and M) differ only in their transition functions and their cost functions. [sent-221, score-0.384]

46 Let the corresponding transition and cost functions be denoted by p E , pM ∗ ∗ and gE , gM , respectively. [sent-222, score-0.211]

47 The value function in E of the deterministic and greedy policy (π) with one stage-lookahead that is ∗ π based upon JM is denoted by JE . [sent-224, score-0.249]

48 Another interesting question is the effects of environmental changes on the value function of a ﬁxed control policy. [sent-227, score-0.219]

49 Note that the presented theorems are only valid in case of discounted MDPs. [sent-229, score-0.201]

50 Though, a large part of the MDP related research studies the expected total discounted cost optimality criterion, in 1687 ´ C S AJI AND M ONOSTORI some cases discounting is inappropriate and, therefore, there are alternative optimality approaches, as well. [sent-230, score-0.209]

51 Regarding the validity of the results of Section 4 concerning MDPs with the expected average cost minimization objective, we can recall that, in the case of ﬁnite MDPs, discounted cost offers a good approximation to the other optimality criterion. [sent-233, score-0.259]

52 More precisely, it can be shown that there exists a large enough α0 < 1 such that ∀ α ∈ (α0 , 1) optimal control policies for the discounted cost problem are also optimal for the average cost problem (Feinberg and Shwartz, 2002). [sent-234, score-0.331]

53 a ∞ Deﬁnition 16 A sequence of MDPs (Mt )t=1 is called an ε-MDP with ε > 0 if the MDPs differ only in their transition-probability functions, denoted by pt for Mt , and there exists an MDP with transition function p, called the base MDP, such that supt p − pt ≤ ε. [sent-244, score-0.334]

54 lim sup g − gt ≤ δ t→∞ The optimal value function of the base MDP and of the current MDP at time t (which MDP has transition function pt and cost function gt ) are denoted by J ∗ and Jt∗ , respectively. [sent-252, score-0.451]

55 Lemma 18 Assume that two discounted MDPs, M1 and M2 , differ only in the discount factors, denoted by α1 and α2 . [sent-254, score-0.434]

56 Then, there exists an MDP, denoted by M3 , such that it differs only in the immediate-cost function from M1 , thus its discount factor is α1 , and it has the same optimal value function as M2 . [sent-255, score-0.224]

57 On the other hand, we can notice that changes in the cost function cannot be traced back to changes in the transition function. [sent-257, score-0.322]

58 2 Relaxed Convergence of Stochastic Iterative Algorithms In this section we present a general relaxed convergence theorem for a large class of stochastic iterative algorithms. [sent-268, score-0.26]

59 Later, we will apply this theorem to investigate the convergence properties of value function based reinforcement learning methods in (ε, δ)-MDPs. [sent-269, score-0.231]

60 Assumption 3 For all t, operator Kt : V → V is a supremum norm contraction mapping with Lipschitz constant βt < 1 and with ﬁxed point Vt∗ . [sent-296, score-0.279]

61 Now, we present a theorem (its proof can be found in the appendix) that shows how the function sequence generated by iteration (6) can converge to an environment of a function. [sent-302, score-0.217]

62 1 A S IMPLE N UMERICAL E XAMPLE Consider a one dimensional stochastic process characterized by the iteration vt+1 = (1 − γt )vt + γt (Kt (vt ) + wt ), (7) where γt is the learning rate and wt is a noise term. [sent-312, score-0.261]

63 One may think that since each ki attracts the point towards its ﬁxed point, the sequence vt converges to the convex hull of the ﬁxed points. [sent-332, score-0.488]

64 the following argument shows that sequence vt cannot get arbitrarily far from the ﬁxed points. [sent-346, score-0.427]

65 Furthermore, let β0 be i j deﬁned as β0 = maxi bi and dt as dt = mini v∗ − vt . [sent-349, score-0.488]

66 Consequently, if vt somehow got farther than φ(β0 , ρ), in the next step it would inevitably be attracted towards the ﬁxed points. [sent-354, score-0.431]

67 3 Reinforcement Learning in (ε, δ)-MDPs In case of ﬁnite (ε, δ)-MDPs we can formulate a relaxed convergence theorem for value function based reinforcement learning algorithms, as a corollary of Theorem 20. [sent-359, score-0.283]

68 Assume, for example, that we use the supremum norm, · ∞ , for cost functions and · 1 , deﬁned by Equation (5), for transition functions. [sent-363, score-0.241]

69 1 − β0 Notice that as parameters ε and δ go to zero, we get back to a classical convergence theorem for this kind of stochastic iterative algorithm (still in a little bit generalized form, since βt might still change over time). [sent-369, score-0.208]

70 In the case of (ε, δ)-MDPs a small stepsize variant of asynchronous value iteration can be deﬁned as Jt+1 (x) = (1 − γt (x))Jt (x) + γt (x)(Tt Jt )(x), where Tt is the Bellman operator of the current MDP at time t. [sent-377, score-0.238]

71 Assumption 3 follows from the fact that each Tt operator is an α contraction where α is the discount factor. [sent-379, score-0.36]

72 3 T EMPORAL D IFFERENCE L EARNING IN (ε, δ)-MDP S Temporal difference learning, or for short TD-learning, is a policy evaluation algorithm. [sent-394, score-0.212]

73 It aims at ﬁnding the corresponding value function J π for a given control policy π (Bertsekas and Tsitsiklis, 1996; Sutton and Barto, 1998). [sent-395, score-0.284]

74 It can also be used for approximating the optimal value function, for example, if we apply it together with the policy iteration algorithm. [sent-396, score-0.248]

75 The proof is based on the observation that iteration (10) can be seen as a RobbinsMonro type stochastic iterative algorithm for ﬁnding the ﬁxed point of J π = HJ π , where H is a contraction mapping with Lipschitz constant α (Bertsekas and Tsitsiklis, 1996). [sent-407, score-0.227]

76 Therefore, we can apply Theorem 20 to achieve a relaxed convergence result for off-line ﬁrst-visit TD(λ) in changing environments under the same conditions as in the case of ordinary MDPs. [sent-410, score-0.235]

77 Operator Bt is time-dependent since, for example, if we model an approximate version of optimistic policy iteration, then in each iteration the control policy changes and, therefore, the update operator changes, as well. [sent-429, score-0.692]

78 Suppose that Assumptions 1-2 hold and each Bt is a contraction as well as Π is linear and supremum norm nonexpansive, then Φ(rt ) κ-approximates V ∗ for any initial r0 with κ = 4ρ/(1 − β0 ), where ρ = lim supt→∞ Vt∗ −V ∗ . [sent-444, score-0.252]

79 In our randomized grid worlds the agent was taken to a random surrounding state (no matter what action it chose) with probability η and this probability changed after each step. [sent-491, score-0.231]

80 Table 1 was generated using an (optimistic) SARSA algorithm, namely, the current policy was evaluated by SARSA, then the policy was (optimistically) improved, more precisely, the greedy policy with respect to the achieved evaluation was calculated. [sent-497, score-0.636]

81 That policy was also soft, namely, it made random explorations with probability 0. [sent-498, score-0.212]

82 An explanation could be that large changes in the transitions cause the agent to loose control over the events, since it becomes very hard to predict the effects of the actions and, hence, to estimate the expected costs. [sent-654, score-0.246]

83 In the paper we have argued that the optimal value function of a (discounted) MDP Lipschitz continuously depends on the transition-probability function and the immediate-cost function, therefore, small changes in the environment result only in small changes in the optimal value function. [sent-659, score-0.238]

84 A bound for changes in the discount factor was also proven, and it was demonstrated that, in general, this dependence was not Lipschitz continuous. [sent-661, score-0.264]

85 Additionally, it was shown that changes in the discount rate could be traced back to changes in the immediate-cost function. [sent-662, score-0.371]

86 Later, the effects of environmental changes on Q-learning based ﬂexible job-shop scheduling was experimentally studied. [sent-671, score-0.219]

87 These results were illustrated through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning policy evaluation. [sent-676, score-0.303]

88 First, analyzing the effects of environmental changes on the value function in case of the expected average cost optimization criterion would be interesting. [sent-680, score-0.197]

89 Theorem 12 Assume that two discounted MDPs differ only in the immediate-cost functions, denoted ∗ ∗ by g1 and g2 . [sent-704, score-0.247]

90 Theorem 13 Assume that two discounted MDPs differ only in the discount factors, denoted by ∗ ∗ α1 , α2 ∈ [0, 1). [sent-712, score-0.434]

91 Proof Let π∗ denote a greedy and deterministic policy based on value function Ji∗ , where i ∈ {1, 2}. [sent-714, score-0.212]

92 i Naturally, policy π∗ is optimal if the discount rate is αi (Theorem 6). [sent-715, score-0.399]

93 Then, let us introduce a i ˆ deterministic Markovian control policy π deﬁned as  ∗ ∗ ∗  π1 (x) if J1 (x) ≤ J2 (x), ˆ π(x) =  ∗ ∗ ∗ π2 (x) if J2 (x) < J1 (x). [sent-716, score-0.284]

94 Lemma 14 Assume that we have two discounted MDPs which differ only in the transition-probability functions or only in the immediate-cost functions or only in the discount factors. [sent-720, score-0.469]

95 Case 3: Assume that the MDPs differ only in the discount rates, denoted by α 1 and α2 . [sent-731, score-0.275]

96 Lemma 18 Assume that two discounted MDPs, M1 and M2 , differ only in the discount factors, denoted by α1 and α2 . [sent-735, score-0.434]

97 Then, there exists an MDP, denoted by M3 , such that it differs only in the immediate-cost function from M1 , thus its discount factor is α1 , and it has the same optimal value function as M2 . [sent-736, score-0.224]

98 n 0 During the proof we will apply the solutions of suitable ﬁnite horizon problems, thus, in order to avoid notational confusions, let us denote the optimal state and action value functions of M 2 and M3 by J ∗ , Q∗ and Jˆ∗ , Q∗ , respectively. [sent-751, score-0.219]

99 and t=0 t=0 Assumption 3 For all t, operator Kt : V → V is a supremum norm contraction mapping with Lipschitz constant βt < 1 and with ﬁxed point Vt∗ . [sent-764, score-0.279]

100 Estimates for perturbations of general discounted Markov control chains. [sent-819, score-0.231]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vt', 0.396), ('mdp', 0.318), ('mdps', 0.249), ('kt', 0.245), ('policy', 0.212), ('discount', 0.187), ('discounted', 0.159), ('aji', 0.149), ('einforcement', 0.149), ('hanging', 0.149), ('nvironments', 0.149), ('onostori', 0.149), ('mins', 0.126), ('reinforcement', 0.124), ('ht', 0.124), ('jt', 0.113), ('lipschitz', 0.109), ('gt', 0.092), ('contraction', 0.09), ('qt', 0.09), ('tsitsiklis', 0.09), ('transition', 0.088), ('bertsekas', 0.087), ('sarsa', 0.084), ('environment', 0.084), ('bellman', 0.083), ('operator', 0.083), ('ft', 0.081), ('wt', 0.081), ('changes', 0.077), ('scheduling', 0.072), ('control', 0.072), ('environments', 0.07), ('environmental', 0.07), ('stepsize', 0.069), ('szita', 0.069), ('action', 0.069), ('supremum', 0.067), ('agent', 0.066), ('theorem', 0.066), ('stochastic', 0.063), ('jn', 0.062), ('rl', 0.062), ('horizon', 0.06), ('feinberg', 0.059), ('kalm', 0.059), ('szepesv', 0.059), ('lim', 0.056), ('earning', 0.056), ('adp', 0.055), ('supt', 0.055), ('state', 0.054), ('relaxed', 0.052), ('differ', 0.051), ('asynchronous', 0.05), ('monostori', 0.05), ('cost', 0.05), ('markovian', 0.048), ('sutton', 0.046), ('dt', 0.046), ('td', 0.045), ('shwartz', 0.045), ('tt', 0.045), ('changing', 0.043), ('xt', 0.043), ('theorems', 0.042), ('grid', 0.042), ('convergence', 0.041), ('temporal', 0.041), ('pathological', 0.04), ('xtk', 0.04), ('norm', 0.039), ('resource', 0.039), ('ji', 0.039), ('job', 0.039), ('costs', 0.038), ('iterative', 0.038), ('denoted', 0.037), ('iteration', 0.036), ('functions', 0.036), ('pt', 0.036), ('bt', 0.035), ('farther', 0.035), ('rt', 0.034), ('barto', 0.034), ('sgn', 0.034), ('stepsizes', 0.034), ('littman', 0.034), ('je', 0.034), ('xm', 0.032), ('hull', 0.032), ('arg', 0.032), ('actions', 0.031), ('sequence', 0.031), ('traced', 0.03), ('bal', 0.03), ('csan', 0.03), ('obsolete', 0.03), ('ordinary', 0.029), ('ki', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

2 0.23810366 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based ﬁtted value iteration (FVI) to solve inﬁnite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of ﬁnite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufﬁciently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reﬂects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical ﬁndings. Keywords: ﬁtted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control

3 0.10349227 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classiﬁcation where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online predictor by deﬁning a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution of the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with synthetic data and with multiclass text categorization tasks. Keywords: online learning, parallel computation, mistake bounds, structured prediction

4 0.09314359 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study

Author: Avraham Bab, Ronen I. Brafman

Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the ﬁeld, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to speciﬁc tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games

5 0.089740746 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning

Author: Sungwook Yoon, Alan Fern, Robert Givan

Abstract: A number of today’s state-of-the-art planners are based on forward state-space search. The impressive performance can be attributed to progress in computing domain independent heuristics that perform well across many domains. However, it is easy to ﬁnd domains where such heuristics provide poor guidance, leading to planning failure. Motivated by such failures, the focus of this paper is to investigate mechanisms for learning domain-speciﬁc knowledge to better control forward search in a given domain. While there has been a large body of work on inductive learning of control knowledge for AI planning, there is a void of work aimed at forward-state-space search. One reason for this may be that it is challenging to specify a knowledge representation for compactly representing important concepts across a wide range of domains. One of the main contributions of this work is to introduce a novel feature space for representing such control knowledge. The key idea is to deﬁne features in terms of information computed via relaxed plan extraction, which has been a major source of success for non-learning planners. This gives a new way of leveraging relaxed planning techniques in the context of learning. Using this feature space, we describe three forms of control knowledge—reactive policies (decision list rules and measures of progress) and linear heuristics—and show how to learn them and incorporate them into forward state-space search. Our empirical results show that our approaches are able to surpass state-of-the-art nonlearning planners across a wide range of planning competition domains. Keywords: planning, machine learning, knowledge representation, search

6 0.073963262 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

7 0.072738096 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

8 0.069995798 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

9 0.065989934 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

10 0.064408757 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

11 0.050205626 80 jmlr-2008-Ranking Individuals by Group Comparisons

12 0.049478874 52 jmlr-2008-Learning from Multiple Sources

13 0.046913434 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations

14 0.042221934 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

15 0.036909446 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

16 0.036167145 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

17 0.035967678 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

18 0.035445694 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

19 0.035008579 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

20 0.032727126 92 jmlr-2008-Universal Multi-Task Kernels

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.194), (1, -0.059), (2, -0.09), (3, -0.075), (4, -0.292), (5, -0.396), (6, -0.191), (7, -0.06), (8, 0.01), (9, 0.005), (10, 0.007), (11, -0.136), (12, 0.049), (13, -0.023), (14, -0.159), (15, 0.013), (16, -0.147), (17, -0.102), (18, 0.041), (19, -0.044), (20, -0.014), (21, -0.174), (22, 0.066), (23, -0.024), (24, -0.1), (25, 0.093), (26, 0.009), (27, 0.012), (28, -0.18), (29, 0.029), (30, 0.026), (31, 0.061), (32, -0.039), (33, 0.059), (34, 0.088), (35, 0.046), (36, -0.103), (37, -0.002), (38, -0.05), (39, -0.032), (40, 0.006), (41, 0.079), (42, 0.039), (43, -0.008), (44, 0.015), (45, -0.063), (46, 0.117), (47, -0.079), (48, 0.052), (49, -0.003)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96193469 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

2 0.76066577 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

3 0.44828382 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

4 0.39082047 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning

Author: Sungwook Yoon, Alan Fern, Robert Givan

5 0.34647566 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study

Author: Avraham Bab, Ronen I. Brafman

6 0.30927131 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

7 0.29705629 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

8 0.28999186 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

9 0.28944981 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

10 0.20801939 37 jmlr-2008-Forecasting Web Page Views: Methods and Observations

11 0.20773335 80 jmlr-2008-Ranking Individuals by Group Comparisons

12 0.16991095 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

13 0.16680761 52 jmlr-2008-Learning from Multiple Sources

14 0.15962668 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

15 0.15172829 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

16 0.14832622 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

17 0.14341339 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure

18 0.13751444 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

19 0.13745639 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

20 0.13169909 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.023), (40, 0.017), (54, 0.618), (58, 0.028), (66, 0.032), (88, 0.075), (92, 0.037), (94, 0.041), (99, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.960329 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

2 0.89634627 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

Author: Vojtěch Franc, Bogdan Savchynskyy

Abstract: The max-sum classiﬁer predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions deﬁned over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classiﬁer. Learning parameters of the max-sum classiﬁer is a challenging problem because even computing the response of such classiﬁer is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classiﬁers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classiﬁers, have been extended for learning some subclasses of the max-sum classiﬁer. Besides the max-sum classiﬁers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulﬁll some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classiﬁers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classiﬁers whose response can be computed in polynomial time: (i) the max-sum classiﬁers with supermodular quality functions and (ii) the max-sum classiﬁers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classiﬁer. Keywords: max-xum classiﬁer, hidden Markov networks, support vector machines

3 0.83148581 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

Author: David Mease, Abraham Wyner

Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial ﬂaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these ﬂaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost

4 0.61197233 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

5 0.50450802 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-Shwartz, Yoram Singer

6 0.49389341 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

7 0.45494014 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression

8 0.44263968 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

9 0.42766872 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study

10 0.41225359 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

11 0.41113004 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss

12 0.41095775 83 jmlr-2008-Robust Submodular Observation Selection

13 0.40433496 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields