jmlr jmlr2006 jmlr2006-19 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anders Jonsson, Andrew Barto
Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction
Reference: text
sentIndex sentText sentNum sentScore
1 Ravindran (2004) showed that an option o is associated with a stand-alone task given by the option SMDP Mo = S, Oo , Ψo , Po , Ro , where Oo is a set of lower-level options. [sent-154, score-0.795]
2 The policy π of option o can be defined as any optimal policy of the option SMDP Mo . [sent-158, score-0.936]
3 For each exit, VISA introduces an exit option, that is, an activity that terminates when the precondition of the exit is met and then executes the exit action. [sent-191, score-1.084]
4 In other words, the purpose of an exit option is to change the value of a state variable by first satisfying the necessary precondition and then executing the appropriate action. [sent-192, score-0.905]
5 To determine the policy of each exit option, VISA constructs the components of an option SMDP and defines the policy as a solution to the resulting option SMDP. [sent-193, score-1.298]
6 To simplify learning the policy of an exit option, VISA performs state abstraction for the option SMDP. [sent-194, score-0.992]
7 From the causal graph it is easy to identify a set of state variables that are irrelevant, that 2265 J ONSSON AND BARTO SC DC BC DC SL SH GU SU GO SR R GO SW Figure 2: The causal graph of the coffee task is, do not influence state variables whose values appear in the precondition. [sent-195, score-0.732]
8 In addition, the option SMDP only needs to include actions and options that change the values of state variables that appear in the precondition. [sent-197, score-0.677]
9 The resulting state abstraction significantly reduces the complexity of learning the exit option policies. [sent-198, score-0.882]
10 The causal graph implicitly defines a hierarchy of options in which an exit option that changes the value of a state variable in a strongly connected component selects between options that change the values of state variables in strongly connected components with incoming edges. [sent-199, score-1.568]
11 Just as for the other variables, there is an edge in the causal graph between a state variable S i and the expected reward node if and only if there is an action a ∈ A such that the expected reward as a result of executing a depends on the value of Si . [sent-222, score-0.679]
12 Note that the value of S W does not appear in the exit since that is the state variable whose value the exit changes. [sent-250, score-0.767]
13 These are almost identical: their associated exit options both terminate in states that assign the value L to state variable S L and execute action DC following successful termination. [sent-260, score-0.71]
14 Recall that C → C is the exit option associated with the exit (SL = L), DC , causing the value of SC to change from C to C. [sent-261, score-1.048]
15 The effect of the exit (SL = L, SC = C), DC is equivalent to the effect of a transformed exit (SC = C),C → C , that is, reach a state that assigns C to SC and execute option C → C following termination. [sent-262, score-1.125]
16 The benefit of this transformation is that the exit option H → H associated with the exit (S L = L, SC = C), DC no longer has to care about the value of SL , effectively removing an edge in the component graph of the task. [sent-263, score-1.108]
17 4 Introducing Exit Options For each exit c, a with a non-empty context c, VISA introduces an option o = I, π, β . [sent-266, score-0.692]
18 We refer to an option 2268 C AUSAL G RAPH BASED D ECOMPOSITION SU U SU U U true U false Figure 4: The transition graph (left) and reachability tree (right) of the component S U introduced by VISA as an exit option. [sent-268, score-0.841]
19 Unlike regular options, an exit option associated with an exit c, a executes action a following termination. [sent-269, score-1.137]
20 Executing action GO in any state causes the location of the robot to change, so the exit option associated with this exit is equivalent to the primitive action GO. [sent-272, score-1.442]
21 In the coffee task example, we adopt the convention of referring to an exit option using the change that it causes, since this is an unambiguous and simple notation. [sent-274, score-0.912]
22 For example, option W → W is the exit option associated with the exit (SU = U, SR = R), GO that causes the value of SW to change from W to W with non-zero probability. [sent-275, score-1.406]
23 In general, several exits may cause the same change in the value of a variable, and VISA would introduce an exit option for each of these exits, so this notation is not always unambiguous. [sent-276, score-0.803]
24 1 I NITIATION S ET The initiation set I of exit option o determines when o is admissible, that is, the subset of states in which it is possible to execute o. [sent-279, score-0.787]
25 In our example, option W → W should only be admissible in states that assign W to SW , since otherwise the option cannot cause the value of S W to change from W to W . [sent-285, score-0.811]
26 The robot can acquire an umbrella by executing the exit option U → U, so there is a corresponding edge in the transition graph between the leaf associated with states that assign U to SU and the leaf associated with states that assign U to SU . [sent-291, score-1.088]
27 Since an exit represents termination of an exit option, here we are interested in computing backward reachability: from which states is it possible to reach the exit? [sent-304, score-0.736]
28 2 T ERMINATION C ONDITION An exit option terminates as soon as it reaches the context c of its associated exit c, a , or as soon as it can no longer reach c. [sent-310, score-1.071]
29 Even though an exit option executes action a following termination, we can still represent termination of the option using the standard termination condition function β. [sent-311, score-1.205]
30 3 P OLICY VISA cannot directly define the policy of an exit option since it does not know the best strategy for reaching the associated context c. [sent-317, score-0.824]
31 Instead, the algorithm constructs an option SMDP M o = So , Oo , Ψo , Po , Ro for option o that implicitly defines its policy π. [sent-318, score-0.854]
32 For example, consider the exit option W → W and its associated context (S U = U, SR = R). [sent-322, score-0.714]
33 In other words, the option set Oo of W → W only needs to include the exit option U → U. [sent-325, score-1.05]
34 If there are lower-level options that cause the process to leave the initiation set of an option in Oo , VISA includes these options in Oo as well. [sent-327, score-0.704]
35 For example, the exit option U → U causes the process to leave the initiation set of the exit option W → W . [sent-328, score-1.444]
36 If the robot does not have an umbrella and it is raining, the exit option W → W will no longer be admissible as a result of executing the exit option U → U causing the robot to hold an umbrella. [sent-329, score-1.663]
37 In other words, an option whose option set Oo includes the exit option W → W should include the exit option U → U as well. [sent-330, score-2.1]
38 5 State Abstraction VISA simplifies learning in the option SMDPs by performing state abstraction separately for each exit option. [sent-343, score-0.882]
39 For example, in the case of exit option W → W , Z = {SU , SR } and Y = {SL , SU , SR }, since there is a directed path from SL to SU in the causal graph of the coffee task. [sent-348, score-0.991]
40 Recall that the goal of an exit option o is to reach the associated context c. [sent-349, score-0.714]
41 The initiation set of option o is determined by the state variables in Z ⊆ Y, so whether or not the process leaves the initiation set depends only on those state variables. [sent-366, score-0.697]
42 In other words, the option SMDP of option o ignores all state variables not in strongly connected components for which the value of at least one state variable appears in the context c associated with option o. [sent-369, score-1.423]
43 In addition, it follows from Theorem 2 that the partition Λ Z induces a reduced SMDP that preserves optimality if and only if for each exit option o ∈ Oo , state variables in Y − Z do not influence the state variables in Z as a result of executing o . [sent-378, score-1.065]
44 For example, consider the exit option H → H in the coffee task. [sent-381, score-0.855]
45 However, as a result of executing the exit option C → C, the resulting value of S C does not depend on the previous value of SL . [sent-385, score-0.783]
46 The same is true for exit option C → C, so it follows from Theorem 2 that the partition ΛZ induced by f Z has the stochastic substitution property. [sent-387, score-0.759]
47 The policy tree structure of an option can be constructed by merging the transition graph trees of strongly connected components that contain state variables whose values appear in the associated context. [sent-398, score-0.816]
48 Another part of abstraction is reducing the number of options in the option set O o of the option SMDP Mo . [sent-400, score-0.938]
49 If there are fewer options to select from, an autonomous agent can discover more quickly which option or options result in an optimal value for each block of the state partition. [sent-401, score-0.719]
50 The algorithm fills the option set Oo with options that change the values of state variables in those strongly connected components. [sent-403, score-0.717]
51 However, instead of being a policy that selects among primitive actions, the policy of the task option selects among the exit options introduced by VISA. [sent-410, score-1.133]
52 Thus, the policy of the task option represents a hierarchical solution to the task that takes advantage of the exit options to set the values of relevant state variables in such a way as to maximize expected reward. [sent-411, score-1.198]
53 To learn the task option policy, VISA constructs the option SMDP corresponding to the task option using the same strategy it uses for the exit options. [sent-415, score-1.55]
54 However, the expected reward function of the task option SMDP is equal to the expected reward function of the original MDP. [sent-416, score-0.679]
55 For determining the policy of the task option, the expected reward for executing an exit option o is defined as the sum of discounted reward of the primitive actions selected during the execution of o. [sent-417, score-1.337]
56 VISA also performs state abstraction for the task option SMDP in the same way it does for exit options. [sent-418, score-0.939]
57 In addition, the option set of the task option SMDP only includes exit options that change the values of state variables in Z. [sent-421, score-1.358]
58 When provided with the expected reward function of a specific task, VISA can construct the task option by overlaying the expected reward node onto the existing causal graph. [sent-426, score-0.797]
59 7 Option Hierarchy The task option together with the exit options introduced by VISA implicitly define a hierarchy of options in which the options on one level selects options on the next lower level. [sent-430, score-1.296]
60 Recall that for an option associated with a node in the component graph, the option SMDP only includes options that change state variables in components that have edges to that node. [sent-431, score-1.009]
61 Since the component graph is guaranteed to contain no cycles, the option hierarchy is well-defined, and it is not possible for an option to execute itself, either directly or indirectly. [sent-433, score-0.799]
62 At the top level, VISA constructs a task option that uses the exit options to approximate a solution to the original MDP. [sent-461, score-0.908]
63 Given access to the causal graph, VISA makes it possible to efficiently perform state abstraction for any option whose goal is to reach a context specified by an assignment of values to a subset of the state variables. [sent-470, score-0.743]
64 State abstraction for the task option is particularly efficient in tasks for which the expected reward depends on only a few state variables. [sent-473, score-0.763]
65 If most state variables influence reward, learning the task option policy requires almost the same effort as learning a policy over primitive actions. [sent-474, score-0.788]
66 For example, in the coffee task, the option hierarchy discovered by VISA enables optimality, since none of the exit options choose suboptimal actions. [sent-480, score-1.009]
67 Constructing Compact Option Models VISA computes option policies by first constructing the option SMDP, M o , of each exit option o. [sent-580, score-1.433]
68 2 Multi-Time Models for Exit Options Recall that an exit option o is associated with an exit c, a , composed of a context c and an action a. [sent-611, score-1.137]
69 Unlike regular options, the exit option executes action a following termination in context c. [sent-612, score-0.814]
70 The multi-time model of an exit option o has the following form: Po = γ(BPa + (E − B) ∑ Πa Pa Po ), (8) a ∈A Ro = BRa + (E − B) ∑ Πa (Ra + γPa Ro ), (9) a ∈A where a is the action of the exit associated with o. [sent-615, score-1.137]
71 Because we modified the multi-time model to handle the case of exit options, we need to show that the discounted probabilities, Po , and expected reward, Ro , associated with an exit option o are well-defined under the condition that o is guaranteed to eventually terminate. [sent-617, score-1.07]
72 Definition 3 An option o is proper if for each state si ∈ I, o eventually terminates with probability 1 when executed in si . [sent-618, score-0.732]
73 The policy of an exit option is already in the form of a tree, and it is easy to construct a tree that represents the termination condition function β. [sent-629, score-0.857]
74 We define three more properties of partitions of S with respect to a factored MDP M and an option o = I, π, β : Definition 6 A partition Λ of S is policy respecting if for each pair of states (s i , s j ) ∈ S2 and each action a ∈ A, [si ]Λ = [s j ]Λ implies that π(si , a) = π(s j , a). [sent-673, score-0.784]
75 o Take the example of computing the discounted probability PW associated with state variable SW and exit option C → C in the coffee task. [sent-680, score-0.998]
76 Theorem 10 Let o be a proper option and let ΛR be a partition of S that has the stochastic substitution property, is reward respecting, policy respecting, and termination respecting. [sent-685, score-0.7]
77 Usually, all state variables indirectly influence reward, so normally it is not possible to ignore the values of any state variables while computing the expected reward model Ro associated with exit option o. [sent-688, score-1.086]
78 We can take advantage of distribution irrelevance to compute the multi-time model of an exit option when γ = 1. [sent-693, score-0.716]
79 Let o be the exit option associated with the exit c, a . [sent-694, score-1.048]
80 Instead, the conditional probabilities associated with state variable S d ∈ C and option o are given by the conditional probabilities associated with S d and the exit action a, restricted to states s ∈ S such that f C (s) = c. [sent-698, score-1.007]
81 For example, as a result of executing the exit option associated with the exit (SL = L), BC in the coffee task, the value of state variable SL is L immediately before executing BC. [sent-699, score-1.492]
82 As a result of executing the option that acquires coffee, the location of the robot is always the coffee shop, regardless of its previous location. [sent-701, score-0.696]
83 Let Uo ⊆ S denote the subset of state variables whose values do not change as a result of executing any action selected by the policy π of exit option o. [sent-703, score-1.102]
84 For the exit option o associated with exit (SL = L), BC in the coffee task, Uo = {SU , SR , SC , SH }, since the values of these state variables do not change as a result of executing GO, the only action selected by the policy of o. [sent-704, score-1.621]
85 6 Summary of the Algorithm In summary, to compute the multi-time model of an exit option one should first solve Equation 9 to compute the expected reward. [sent-707, score-0.692]
86 When we have computed the conditional probability tree associated with each state variable for an exit option, as well as a tree representing expected reward, we can construct a DBN for the option in the same way that we can for primitive actions. [sent-711, score-0.868]
87 Figure 11 shows the DBN for the exit option associated with the exit (SL = L), BC in the coffee task when γ = 1, taking advantage of distribution irrelevance. [sent-712, score-1.268]
88 Since the DBN model of an option is in the same form as the DBN models of primitive actions, we can treat the option as a single unit and apply any of the algorithms that take advantage of compact representations. [sent-714, score-0.782]
89 Once the policy of an option has been learned, we can construct its DBN model and use that model both to learn the policy of a higher-level option and later to construct a DBN model of the higher-level option. [sent-716, score-0.98]
90 First, we ran VISA on the coffee task and used Equations 8 and 9 to compute the multi-time model of each exit option introduced. [sent-721, score-0.912]
91 For each exit option, including the task option at the top level, we used SPUDD to compute an optimal policy. [sent-722, score-0.749]
92 VISA introduces an option for each exit and uses sophisticated techniques to construct the components of each option. [sent-795, score-0.714]
93 The result is a hierarchy of options in which the policy of an option selects among options at a lower level in the hierarchy. [sent-796, score-0.753]
94 If VISA had access to compact option models, it could use dynamic programming techniques to compute the option policies without interacting with the environment. [sent-811, score-0.774]
95 Recall that VISA performs state abstraction for an option SMDP by constructing the partition ΛZ , where Z ⊆ S is the set of state variables in strongly connected components whose variable values appear in the context of the associated exit. [sent-832, score-0.84]
96 The problem occurs when an option selected by the option policy changes the value of a state variable not in Z that indirectly influences state variables in Z. [sent-834, score-1.045]
97 Since an option takes variable time to execute, the option passes through many states during execution. [sent-843, score-0.751]
98 Since no such state exists for a proper option o, we conclude that all elements on the diagonal of M are larger than 0 for a proper option o. [sent-885, score-0.815]
99 It is possible to eliminate an element ai j , j < i, by subtracting ai j r j from row ri : ri − a i j r j = ai1 − ai j a j1 · · · ai j − ai j · 1 · · · 1 − ai j a ji · · · ain − ai j a jn Lemma 12 0 ≤ 1 − ai j a ji ≤ 1, and 1 − ai j a ji = 0 if and only if ai j = a ji = −1. [sent-916, score-0.884]
100 Proof of Theorem 10 Assume that for each block λ ∈ ΛR and each state sk ∈ λ, the expected reward R(sk , o) as a result of executing option o is equal, and let Ro denote that expected reward. [sent-1012, score-0.73]
wordName wordTfidf (topN-words)
[('visa', 0.541), ('option', 0.358), ('exit', 0.334), ('dbn', 0.286), ('coffee', 0.163), ('smdp', 0.143), ('reward', 0.132), ('options', 0.131), ('si', 0.126), ('policy', 0.11), ('activities', 0.11), ('state', 0.099), ('pd', 0.097), ('causal', 0.096), ('spudd', 0.095), ('ro', 0.092), ('abstraction', 0.091), ('executing', 0.091), ('action', 0.089), ('sl', 0.087), ('exits', 0.087), ('po', 0.085), ('sw', 0.073), ('onsson', 0.072), ('raph', 0.072), ('sc', 0.071), ('vd', 0.071), ('sr', 0.069), ('barto', 0.069), ('su', 0.068), ('actions', 0.068), ('respecting', 0.067), ('ji', 0.066), ('ausal', 0.061), ('ecomposition', 0.061), ('factored', 0.061), ('initiation', 0.06), ('robot', 0.059), ('connected', 0.059), ('oo', 0.058), ('transition', 0.058), ('ai', 0.057), ('task', 0.057), ('sd', 0.054), ('srtdp', 0.051), ('reinforcement', 0.051), ('sk', 0.05), ('strongly', 0.049), ('partition', 0.042), ('agv', 0.041), ('subtask', 0.041), ('fy', 0.04), ('graph', 0.04), ('mdp', 0.039), ('activity', 0.036), ('admissible', 0.036), ('states', 0.035), ('umbrella', 0.034), ('termination', 0.033), ('primitive', 0.033), ('compact', 0.033), ('dean', 0.032), ('hierarchical', 0.031), ('aik', 0.031), ('sh', 0.031), ('reachability', 0.031), ('taxi', 0.031), ('mdps', 0.029), ('constructs', 0.028), ('hengst', 0.027), ('jonsson', 0.027), ('planning', 0.027), ('elimination', 0.027), ('tasks', 0.026), ('location', 0.025), ('fs', 0.025), ('ri', 0.025), ('substitution', 0.025), ('policies', 0.025), ('factory', 0.024), ('irrelevance', 0.024), ('cause', 0.024), ('probabilities', 0.024), ('uence', 0.024), ('hoey', 0.023), ('precondition', 0.023), ('hierarchy', 0.023), ('terminates', 0.023), ('bc', 0.023), ('partitions', 0.022), ('associated', 0.022), ('construct', 0.022), ('discounted', 0.022), ('variables', 0.021), ('go', 0.021), ('simsek', 0.02), ('dc', 0.02), ('component', 0.02), ('mii', 0.02), ('mo', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs
Author: Anders Jonsson, Andrew Barto
Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction
2 0.11433259 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs
Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart
Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration
3 0.10770007 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation
Author: Jelle R. Kok, Nikos Vlassis
Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation
4 0.090853095 10 jmlr-2006-Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour
Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε2 ) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.
5 0.075260736 75 jmlr-2006-Policy Gradient in Continuous Time
Author: Rémi Munos
Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791
6 0.06122734 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery
7 0.059913527 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
8 0.042849205 30 jmlr-2006-Evolutionary Function Approximation for Reinforcement Learning
9 0.039124873 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
10 0.031381927 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
11 0.030527156 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)
12 0.029916124 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
13 0.027450193 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
14 0.026569055 5 jmlr-2006-A Robust Procedure For Gaussian Graphical Model Search From Microarray Data WithpLarger Thann
15 0.025222354 89 jmlr-2006-Structured Prediction, Dual Extragradient and Bregman Projections (Special Topic on Machine Learning and Optimization)
17 0.024558252 49 jmlr-2006-Learning Parts-Based Representations of Data
18 0.024533236 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification
19 0.024086531 25 jmlr-2006-Distance Patterns in Structural Similarity
topicId topicWeight
[(0, 0.138), (1, 0.127), (2, -0.252), (3, 0.069), (4, 0.183), (5, 0.0), (6, -0.02), (7, -0.029), (8, -0.029), (9, -0.039), (10, 0.097), (11, 0.082), (12, -0.094), (13, 0.039), (14, 0.057), (15, -0.063), (16, 0.073), (17, 0.057), (18, 0.002), (19, -0.196), (20, -0.035), (21, -0.031), (22, -0.025), (23, -0.084), (24, -0.004), (25, 0.129), (26, 0.077), (27, 0.062), (28, 0.062), (29, 0.033), (30, 0.083), (31, 0.11), (32, -0.006), (33, 0.026), (34, 0.265), (35, -0.015), (36, 0.098), (37, -0.054), (38, -0.023), (39, 0.076), (40, -0.062), (41, -0.02), (42, 0.013), (43, 0.082), (44, 0.03), (45, -0.044), (46, 0.052), (47, 0.134), (48, -0.254), (49, 0.174)]
simIndex simValue paperId paperTitle
same-paper 1 0.96940595 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs
Author: Anders Jonsson, Andrew Barto
Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction
Author: Eyal Even-Dar, Shie Mannor, Yishay Mansour
Abstract: We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O (n/ε2 ) log(1/δ) times to find an ε-optimal arm with probability of at least 1 − δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise action elimination procedures in reinforcement learning algorithms. We describe a framework that is based on learning the confidence interval around the value function or the Q-function and eliminating actions that are not optimal (with high probability). We provide a model-based and a model-free variants of the elimination method. We further derive stopping conditions guaranteeing that the learned policy is approximately optimal with high probability. Simulations demonstrate a considerable speedup and added robustness over ε-greedy Q-learning.
3 0.48331678 4 jmlr-2006-A Linear Non-Gaussian Acyclic Model for Causal Discovery
Author: Shohei Shimizu, Patrik O. Hoyer, Aapo Hyvärinen, Antti Kerminen
Abstract: In recent years, several methods have been proposed for the discovery of causal structure from non-experimental data. Such methods make various assumptions on the data generating process to facilitate its identification from purely observational data. Continuing this line of research, we show how to discover the complete causal structure of continuous-valued data, under the assumptions that (a) the data generating process is linear, (b) there are no unobserved confounders, and (c) disturbance variables have non-Gaussian distributions of non-zero variances. The solution relies on the use of the statistical method known as independent component analysis, and does not require any pre-specified time-ordering of the variables. We provide a complete Matlab package for performing this LiNGAM analysis (short for Linear Non-Gaussian Acyclic Model), and demonstrate the effectiveness of the method using artificially generated data and real-world data. Keywords: independent component analysis, non-Gaussianity, causal discovery, directed acyclic graph, non-experimental data
4 0.4544726 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs
Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart
Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration
5 0.44174963 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation
Author: Jelle R. Kok, Nikos Vlassis
Abstract: In this article we describe a set of scalable techniques for learning the behavior of a group of agents in a collaborative multiagent setting. As a basis we use the framework of coordination graphs of Guestrin, Koller, and Parr (2002a) which exploits the dependencies between agents to decompose the global payoff function into a sum of local terms. First, we deal with the single-state case and describe a payoff propagation algorithm that computes the individual actions that approximately maximize the global payoff function. The method can be viewed as the decision-making analogue of belief propagation in Bayesian networks. Second, we focus on learning the behavior of the agents in sequential decision-making tasks. We introduce different model-free reinforcementlearning techniques, unitedly called Sparse Cooperative Q-learning, which approximate the global action-value function based on the topology of a coordination graph, and perform updates using the contribution of the individual agents to the maximal global action value. The combined use of an edge-based decomposition of the action-value function and the payoff propagation algorithm for efficient action selection, result in an approach that scales only linearly in the problem size. We provide experimental evidence that our method outperforms related multiagent reinforcement-learning methods based on temporal differences. Keywords: collaborative multiagent system, coordination graph, reinforcement learning, Qlearning, belief propagation
6 0.29215211 7 jmlr-2006-A Simulation-Based Algorithm for Ergodic Control of Markov Chains Conditioned on Rare Events
7 0.23929963 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
8 0.23101953 21 jmlr-2006-Computational and Theoretical Analysis of Null Space and Orthogonal Linear Discriminant Analysis
9 0.20511205 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization
10 0.17669533 75 jmlr-2006-Policy Gradient in Continuous Time
11 0.17461953 50 jmlr-2006-Learning Recursive Control Programs from Problem Solving (Special Topic on Inductive Programming)
12 0.16889985 6 jmlr-2006-A Scoring Function for Learning Bayesian Networks based on Mutual Information and Conditional Independence Tests
14 0.15878566 15 jmlr-2006-Bayesian Network Learning with Parameter Constraints (Special Topic on Machine Learning and Optimization)
15 0.14309564 64 jmlr-2006-Noisy-OR Component Analysis and its Application to Link Analysis
17 0.13873082 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling
18 0.13216393 71 jmlr-2006-Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis
topicId topicWeight
[(8, 0.012), (21, 0.012), (36, 0.063), (45, 0.029), (48, 0.418), (49, 0.027), (50, 0.056), (61, 0.011), (63, 0.022), (67, 0.044), (68, 0.016), (76, 0.02), (78, 0.038), (79, 0.023), (81, 0.029), (90, 0.024), (91, 0.03), (96, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.7745353 19 jmlr-2006-Causal Graph Based Decomposition of Factored MDPs
Author: Anders Jonsson, Andrew Barto
Abstract: We present Variable Influence Structure Analysis, or VISA, an algorithm that performs hierarchical decomposition of factored Markov decision processes. VISA uses a dynamic Bayesian network model of actions, and constructs a causal graph that captures relationships between state variables. In tasks with sparse causal graphs VISA exploits structure by introducing activities that cause the values of state variables to change. The result is a hierarchy of activities that together represent a solution to the original task. VISA performs state abstraction for each activity by ignoring irrelevant state variables and lower-level activities. In addition, we describe an algorithm for constructing compact models of the activities introduced. State abstraction and compact activity models enable VISA to apply efficient algorithms to solve the stand-alone subtask associated with each activity. Experimental results show that the decomposition introduced by VISA can significantly accelerate construction of an optimal, or near-optimal, policy. Keywords: Markov decision processes, hierarchical decomposition, state abstraction
2 0.25854206 24 jmlr-2006-Consistency of Multiclass Empirical Risk Minimization Methods Based on Convex Loss
Author: Di-Rong Chen, Tao Sun
Abstract: The consistency of classification algorithm plays a central role in statistical learning theory. A consistent algorithm guarantees us that taking more samples essentially suffices to roughly reconstruct the unknown distribution. We consider the consistency of ERM scheme over classes of combinations of very simple rules (base classifiers) in multiclass classification. Our approach is, under some mild conditions, to establish a quantitative relationship between classification errors and convex risks. In comparison with the related previous work, the feature of our result is that the conditions are mainly expressed in terms of the differences between some values of the convex function. Keywords: multiclass classification, classifier, consistency, empirical risk minimization, constrained comparison method, Tsybakov noise condition
3 0.25102714 74 jmlr-2006-Point-Based Value Iteration for Continuous POMDPs
Author: Josep M. Porta, Nikos Vlassis, Matthijs T.J. Spaan, Pascal Poupart
Abstract: We propose a novel approach to optimize Partially Observable Markov Decisions Processes (POMDPs) defined on continuous spaces. To date, most algorithms for model-based POMDPs are restricted to discrete states, actions, and observations, but many real-world problems such as, for instance, robot navigation, are naturally defined on continuous spaces. In this work, we demonstrate that the value function for continuous POMDPs is convex in the beliefs over continuous state spaces, and piecewise-linear convex for the particular case of discrete observations and actions but still continuous states. We also demonstrate that continuous Bellman backups are contracting and isotonic ensuring the monotonic convergence of value-iteration algorithms. Relying on those properties, we extend the P ERSEUS algorithm, originally developed for discrete POMDPs, to work in continuous state spaces by representing the observation, transition, and reward models using Gaussian mixtures, and the beliefs using Gaussian mixtures or particle sets. With these representations, the integrals that appear in the Bellman backup can be computed in closed form and, therefore, the algorithm is computationally feasible. Finally, we further extend P ERSEUS to deal with continuous action and observation sets by designing effective sampling approaches. Keywords: planning under uncertainty, partially observable Markov decision processes, continuous state space, continuous action space, continuous observation space, point-based value iteration
4 0.24351276 75 jmlr-2006-Policy Gradient in Continuous Time
Author: Rémi Munos
Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791
5 0.24341744 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix
Author: Mikio L. Braun
Abstract: The eigenvalues of the kernel matrix play an important role in a number of kernel methods, in particular, in kernel principal component analysis. It is well known that the eigenvalues of the kernel matrix converge as the number of samples tends to infinity. We derive probabilistic finite sample size bounds on the approximation error of individual eigenvalues which have the important property that the bounds scale with the eigenvalue under consideration, reflecting the actual behavior of the approximation errors as predicted by asymptotic results and observed in numerical simulations. Such scaling bounds have so far only been known for tail sums of eigenvalues. Asymptotically, the bounds presented here have a slower than stochastic rate, but the number of sample points necessary to make this disadvantage noticeable is often unrealistically large. Therefore, under practical conditions, and for all but the largest few eigenvalues, the bounds presented here form a significant improvement over existing non-scaling bounds. Keywords: kernel matrix, eigenvalues, relative perturbation bounds
7 0.24222523 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation
8 0.23914985 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification
9 0.2384991 48 jmlr-2006-Learning Minimum Volume Sets
10 0.23756672 66 jmlr-2006-On Model Selection Consistency of Lasso
11 0.23525603 53 jmlr-2006-Learning a Hidden Hypergraph
12 0.23485139 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data
13 0.234814 20 jmlr-2006-Collaborative Multiagent Reinforcement Learning by Payoff Propagation
14 0.23469074 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation
15 0.23458819 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models
16 0.23231849 46 jmlr-2006-Learning Factor Graphs in Polynomial Time and Sample Complexity
17 0.23203719 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting
18 0.23129359 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation
20 0.23041753 87 jmlr-2006-Stochastic Complexities of Gaussian Mixtures in Variational Bayesian Approximation