jmlr jmlr2008 jmlr2008-36 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
Reference: text
sentIndex sentText sentNum sentScore
1 The approach in the above mentioned works is to learn an approximation to the optimal value function that assigns to each state the best possible expected long-term cumulative reward resulting from starting the control from the selected state. [sent-25, score-0.159]
2 The problem is to find a policy (or controller) that maximizes the expectation of the infinite-horizon, discounted sum of rewards. [sent-57, score-0.248]
3 We investigate two versions of the basic algorithm: In the multi-sample variant a fresh sample set is generated in each iteration, while in the single-sample variant the same sample set is used throughout all the iterations. [sent-65, score-0.128]
4 We will compare our bounds to those available in supervised learning (regression), where alike bounds have two terms: one bounding the bias of the algorithm, while the other bounding the variance, or estimation error. [sent-70, score-0.214]
5 The term bounding the bias decreases when the approximation power of the function class is increased (hence this term is occasionally called the approximation error term). [sent-71, score-0.138]
6 Although our bounds are similar to bounds of supervised learning, there are some notable differences. [sent-73, score-0.134]
7 In fact, our bounds use a different characterization of the approximation power of the function class F , which we call the inherent Bellman error of F : d(T F , F ) = sup inf f − T g . [sent-80, score-0.245]
8 In the above-mentioned counterexamples the inherent Bellman error of the chosen function space is infinite and so the bound (correctly) indicates the possibility of divergence. [sent-83, score-0.147]
9 The bounds on the variance are closer to their regression counterparts: Just like in regression our variance bounds depend on the capacity of the function space used and decay polynomially 817 ´ M UNOS AND S ZEPESV ARI with the number of samples. [sent-84, score-0.134]
10 Nevertheless the bounds still indicate that the maximal error of the procedure in the limit when the number of samples grows to infinity converges to a finite positive number. [sent-87, score-0.143]
11 As it was already hinted above, our bounds display the usual bias-variance tradeoff: In order to keep both the approximation and estimation errors small, the number of samples and the power of the function class has to be increased simultaneously. [sent-89, score-0.147]
12 The precise meaning of this will be given in Section 5, here we note in passing that this condition holds trivially in every finite MDP, and also, more generally, if the MDP’s transition kernel possesses a bounded density. [sent-94, score-0.134]
13 Next, we develop finite-sample bounds for the error committed in a single iteration (Section 4). [sent-111, score-0.128]
14 We extend these results to the problem of obtaining a good policy in Section 6, followed by a construction that allows one to achieve asymptotic consistency when the unknown MDP is smooth with an unknown smoothness factor (Section 7). [sent-113, score-0.225]
15 In words, we say that when action at is executed from state Xt = x the process makes a transition from x to the next state Xt+1 and a reward, Rt , is incurred. [sent-135, score-0.194]
16 A policy is a sequence of functions that maps possible histories to probability distributions over the space of actions. [sent-143, score-0.189]
17 Hence if the space of histories at time step t is denoted by Ht then a policy π is a sequence π0 , π1 , . [sent-144, score-0.189]
18 A policy is called stationary if πt depends only on the last state visited. [sent-155, score-0.267]
19 A policy is called deterministic if for any history x0 , a0 , . [sent-163, score-0.189]
20 Hence, any deterministic stationary policy can be identified by some mapping from the state space to the action space and so in the following, at the price of abusing the notation and the terminology slightly, we will call such mappings policies, too. [sent-170, score-0.306]
21 The goal is to find a policy π that maximizes the expected total discounted reward given any initial state. [sent-171, score-0.321]
22 Under this criterion the value of a policy π and a state x ∈ X is given by V π (x) = E ∞ ∑ γ t Rtπ |X0 = x , t=0 where Rtπ is the reward incurred at time t when executing policy π. [sent-172, score-0.503]
23 The optimal expected total discounted reward when the process is started from state x shall be denoted by V ∗ (x); V ∗ is called the optimal value function and is defined by V ∗ (x) = supπ V π (x). [sent-173, score-0.217]
24 A policy π is called optimal if it attains the optimal values for any state x ∈ X , that is if V π (x) = V ∗ (x) for all x ∈ X . [sent-174, score-0.241]
25 We also let Q∗ (x, a) denote the long-term total expected discounted reward when the process is started from state x, the first executed action is a and it is assumed that after the first step an optimal policy is followed. [sent-175, score-0.412]
26 Since by assumption the action space is finite, the rewards are bounded, and we assume discounting, the existence of deterministic stationary optimal policies is guaranteed (Bertsekas and Shreve, 1978). [sent-176, score-0.148]
27 A deterministic stationary policy π : X → A defines the transition probability kernel P π according to Pπ (dy|x) = P(dy|x, π(x)). [sent-187, score-0.266]
28 In words, (PπV )(x) is the expected value of V after following π for a single time-step when starting from x, and µPπ is the distribution of states if the system is started from X0 ∼ µ and policy π is followed for a single time-step. [sent-190, score-0.189]
29 Hence, µPπ1 Pπ2 is the distribution of states if the system is started from X0 ∼ µ, policy π1 is followed for the first step and then policy π2 is followed for the second step. [sent-192, score-0.378]
30 We say that a (deterministic stationary) policy π is greedy w. [sent-194, score-0.221]
31 a function V ∈ B(X ) if, for all x ∈ X, Z π(x) ∈ arg max r(x, a) + γ a∈A V (y)P(dy|x, a) , R where r(x, a) = zS(dz|x, a) is the expected reward of executing action a in state x. [sent-197, score-0.164]
32 Similarly, to any stationary deterministic policy π there corresponds an operator T π : B(X ) → B(X ) defined by (T πV )(x) = r(x, π(x)) + (PπV )(x). [sent-206, score-0.259]
33 The contraction arguments also show that if |r(x, a)| is bounded by R max > 0 then V ∗ is bounded by Rmax /(1 − γ) and if V0 ∈ B(X ; Rmax /(1 − γ)) then the same holds for Vk , too. [sent-211, score-0.154]
34 The reward kernel S is such that the immediate reward function r is a bounded ˆ ˆ measurable function with bound Rmax . [sent-214, score-0.269]
35 There are two components of this error: The approximation error caused by projecting the iterates into the function space F and the estimation error caused by using a finite, random sample. [sent-269, score-0.143]
36 A simple idea developed there is to bound the estimation error of V by the worst-case estimation error over F : 1 N ∑ |V (Xi ) − g(Xi )| p − V − g N i=1 p p,µ 1 N ∑ | f (Xi ) − g(Xi )| p − f − g f ∈F N i=1 ≤ sup p p,µ . [sent-308, score-0.143]
37 6 Another route to get a useful bound on the number of samples is to derive an upper bound on the metric entropy that grows with the number of samples at a sublinear rate. [sent-347, score-0.2]
38 The problem is easily resolved in the multi-sample variant of the algorithm by noting that the samples used in calculating Vk+1 are independent of the samples used to calculate Vk . [sent-408, score-0.144]
39 Note that the sample-size bounds in this lemma are similar to those of Lemma 1, except that N now depends on the metric entropy of F T − and M depends on the metric entropy of F . [sent-424, score-0.163]
40 We are interested in bounding the loss due to using policy πk instead of an optimal one, where the loss is measured by a weighted p-norm: Lk = V ∗ −V πk p,ρ . [sent-455, score-0.229]
41 Since the previous section’s bounds are given in terms of weighted p-norms, it is natural to develop weighted p-norm bounds for the whole algorithm. [sent-466, score-0.134]
42 The sketch of this argument is as follows: Since we are interested in developing a bound on the performance of the greedy policy w. [sent-470, score-0.261]
43 ∗ Here V plays the role of the final value function estimate, π is a greedy policy w. [sent-474, score-0.221]
44 Now, exploiting that the operator Pπ is linear for any π, iterating these bounds yields upper and lower bounds on V ∗ − VK as a function of {εk }k . [sent-486, score-0.178]
45 P πk ∞ ≤ 1, and starting from the pointwise bounds, one recovers the well-known supremum-norm bounds by just taking the supremum of the bounds’ two sides. [sent-499, score-0.13]
46 Hence, the pointwise bounding technique yields as tight bounds as the previous supremum-norm bounding technique. [sent-500, score-0.184]
47 However, since in the algorithm only the weighted p-norm errors are controlled, instead of taking the pointwise supremum, we integrate the pointwise bounds w. [sent-501, score-0.141]
48 µ is bounded uniformly 828 F INITE -T IME B OUNDS FOR F ITTED VALUE I TERATION with bound Cµ : def Cµ = dP(·|x, a) dµ x∈X ,a∈A sup ∞ < +∞. [sent-514, score-0.187]
49 Assumption A1 can be written in the form P(·|x, a) ≤ Cµ µ(·), an assumption that was introduced by Munos (2003) in a finite MDP context for the analysis of approximate policy iteration. [sent-515, score-0.189]
50 This holds since by definition for any distribution ν and policy π, νPπ ≤ Cµ µ. [sent-543, score-0.228]
51 One way to generalize the above definition to controlled systems and infinite state spaces is to identify yt with the future state distribution when the policies are selected to maximize 1 ˆ the growth rate of yt ∞ . [sent-556, score-0.193]
52 It states that with high probability the final performance of the policy found by the algorithm can be made as close to a constant times the inherent Bellman error of the function space F as desired by selecting a sufficiently high number of samples. [sent-567, score-0.266]
53 The first term bounds the approximation error, the second arises due to the finite number of iterations, while the last two terms bound the estimation error. [sent-592, score-0.141]
54 This form of the bound allows us to reason about the likely best choice of N and M given a fixed budget of n = K × N × M samples (or n = N × M samples per iteration). [sent-593, score-0.132]
55 It follows that given a fixed budget of n samples provided that K > VF − the bound for the single-sample variant is better than the one for the multi-sample variant. [sent-601, score-0.138]
56 Another way to use the above bound is to make comparisons with the rates available in nonparametric regression: First, notice that the approximation error of F is defined as the inherent Bellman error of F instead of using an external reference class. [sent-608, score-0.181]
57 Randomized Policies The previous result shows that by making the inherent Bellman error of the function space small enough, we can ensure a close-to-optimal performance if one uses a policy greedy w. [sent-626, score-0.298]
58 However, the computation of such a greedy policy requires the evaluation of some expectations, whose exact values are however often difficult to compute. [sent-630, score-0.221]
59 In this section we show that by computations analogous to that used in obtaining the iterates we can compute a randomized near-optimal policy based on VK . [sent-631, score-0.238]
60 j j=1 Let the policy πK : X → A be defined by α,λ πK (x) = arg max QM (x, a). [sent-642, score-0.189]
61 The smoothness of the transition probabilities and rewards is defined w. [sent-663, score-0.127]
62 If the error terms, εt , are bounded in supremum norm, then a straightforward analysis shows that asymptotically, the worst-case performance-loss for the policy greedy w. [sent-713, score-0.321]
63 When Vt+1 is the best approximation of TVt in F then supt≥1 εt ∞ can be upper bounded by the inherent Bellman error d ∞ (T F , F ) = 2γ sup f ∈F infg∈F g − T f ∞ and we get the loss-bound (1−γ)2 d∞ (T F , F ). [sent-719, score-0.198]
64 A different analysis, originally proposed by Gordon (1995) and Tsitsiklis and Van Roy (1996), goes by assuming that the iterates satisfy Vt+1 = ΠTVt , where Π is an operator that maps bounded functions to the function space F . [sent-727, score-0.137]
65 In this case, the loss of using the policy 4γ greedy w. [sent-732, score-0.221]
66 Since the optimal policy is unknown, too, it is not quite immediate that the fact that only a single function (that depends on an unknown MDP) must be well approximated should be an advantage. [sent-763, score-0.189]
67 For finite state-space MDPs, Munos (2003, 2005) considered planning scenarios with known dynamics analyzing the stability of both approximate policy iteration and value iteration with weighted L 2 (resp. [sent-780, score-0.251]
68 Using techa niques similar to those developed here, recently we have proved results for the learning scenario when only a single trajectory of some fixed behavior policy is known (Antos et al. [sent-783, score-0.189]
69 The error bounds come in the form of performance differences between a pair of greedy policies: One of the policies from the pair is greedy w. [sent-789, score-0.204]
70 The algorithm considered by Kakade and Langford is called conservative policy iteration (CPI). [sent-808, score-0.22]
71 The general version searches in a fixed policy space, Π, in each step an optimizer picking a policy that maximizes the average of the empirical advantages of the previous policy at a number of states (basepoints) sampled from some distribution. [sent-810, score-0.567]
72 The policy picked this way is mixed into the previous policy to prevent performance drops due to drastic changes, hence the name of the algorithm. [sent-812, score-0.408]
73 3 Kakade (2003) give bounds on the loss of using the policy returned by this procedure relative to using some other policy π (e. [sent-817, score-0.445]
74 , a near-optimal policy) as a function of the total variation distance between ν, the distribution used to sample the basepoints (this distribution is provided by the user), and the discounted future-state distribution underlying π when π is started from a random state sampled from ν (dπ,ν ). [sent-819, score-0.161]
75 2 in Kakade and Langford (2002) bounds the expected performance loss under ν as a function of the imprecision of the optimizer and the Radon-Nykodim derivative of d π∗ ,ν and dπ0 ,ν , where π0 is the policy returned by the algorithm. [sent-823, score-0.256]
76 However this result applies only to the case when the policy set is unrestricted, and hence the result is limited to finite MDPs. [sent-824, score-0.219]
77 This is because the problem of estimating the value function of a policy in a trivial finite-horizon problem with a single time step is equivalent to regression. [sent-827, score-0.189]
78 Further, the MDPs within the class are assumed to be uniformly smooth: for any MDP in the class the Lipschitz constant of the reward function of the MDP is bounded by an appropriate constant and the same holds for the Lipschitz constant of the density function. [sent-839, score-0.156]
79 Then, any algorithm that is guaranteed to return an ε-optimal approximation to the optimal value function must query (sample) the reward function and the transition probabilities at least Ω(1/ε d )-times, for some MDP within the 837 ´ M UNOS AND S ZEPESV ARI class considered. [sent-841, score-0.158]
80 0 Here the first argument of max represents the total future reward given that the product is not replaced, while the second argument gives the total future reward provided that the product is replaced. [sent-878, score-0.146]
81 The optimal policy is π∗ (x) = K if x ∈ [0, x], and π∗ (x) = R if x > x. [sent-880, score-0.189]
82 For this we fix an upper bound for the states, xmax = 10 x, and modify the problem definition such that if the next state y happens to be outside of the domain [0, xmax ] then the product is replaced immediately, and a new state is drawn as if action R R were chosen in the previous time step. [sent-891, score-0.239]
83 The transition density functions p(·|x, a) are bounded by β, thus Assumption A1 holds with Cµ = βxmax = 5. [sent-894, score-0.134]
84 From Figure 3 we observe when the degree l of the polynomials increases, the error decreases first because of the decrease of the inherent approximation error, but eventually increases because of overfitting. [sent-915, score-0.154]
85 841 3 2 1 0 error −3 −2 −1 0 −3 −2 −1 error 1 2 3 ´ M UNOS AND S ZEPESV ARI 0 2 4 6 8 10 0 state 2 4 6 8 10 state Figure 4: Approximation errors for the multi-sample (left figure) and the single-sample (right) variants of sampling based FVI. [sent-933, score-0.164]
86 However, for the multi-sample variant M = 10, while for the single-sample variant M = 100, making the total number of samples used equal in the two cases. [sent-936, score-0.15]
87 The bounds scale with the inherent Bellman error of the function space used in the regression step, and the stochastic stability properties of the MDP. [sent-943, score-0.144]
88 It is an open question if the finiteness of the inherent Bellman error is necessary for the stability of FVI, but the counterexamples discussed in the introduction suggest that the inherent Bellman residual of the function space should indeed play a crucial role in the final performance of FVI. [sent-944, score-0.154]
89 Since the only way to learn about the optimal policy is by running the algorithm, one idea is to change the sampling distribution by moving it closer to the future-state distribution of the most recent policy. [sent-959, score-0.189]
90 First, observe that (12) holds due to the choice of V ˆ ˆ since V − V p,ˆ ≤ f − V p,ˆ holds for all functions f from F and thus the same inequality holds µ µ ∗ ∈ F , too. [sent-989, score-0.145]
91 Therefore the inequality f − TV sup f ∈F p,ˆ µ >Q (16) holds pointwise in Ω, too and hence P Q > ε ≤ P sup f ∈F f − TV p,µ − 844 f − TV p,ˆ µ >ε . [sent-998, score-0.22]
92 Up to (16) the two proofs proceed in an identical way, however, from (16) we continue by concluding that sup sup g∈F f ∈F f −Tg p,µ − 846 f −Tg p,ˆ µ >Q F INITE -T IME B OUNDS FOR F ITTED VALUE I TERATION holds pointwise in Ω. [sent-1044, score-0.162]
93 j The proof is concluded by noting that the covering numbers of F + can be bounded in terms of the covering numbers of F using the arguments presented after the Lemma at the end of Section 4. [sent-1050, score-0.153]
94 First, note that iteration (7) or (6) may be written Vk+1 = TVk − εk where εk , defined by εk = TVk −Vk+1 , is the approximation error of the Bellman operator applied to Vk due to sampling. [sent-1056, score-0.139]
95 The proof is done in two steps: we first prove a statement that gives pointwise bounds (i. [sent-1057, score-0.129]
96 848 F INITE -T IME B OUNDS FOR F ITTED VALUE I TERATION Note that if εk ∞ ≤ ε then letting p → ∞ we get back the well-known, unimprovable supremumnorm error bounds 2γ lim sup V ∗ −V πK ∞ ≤ ε (1 − γ)2 K→∞ for approximate value iteration (Bertsekas and Tsitsiklis, 1996). [sent-1093, score-0.171]
97 Thus, if the bound (24) holds for any ρ then choosing ρ to be a Dirac at each state proves (23). [sent-1096, score-0.131]
98 850 F INITE -T IME B OUNDS FOR F ITTED VALUE I TERATION def By Lebesgue’s monotone convergence theorem, g n (y) = E [ fn (X, y)] → E [ f (X, y)] (= g(y)). [sent-1132, score-0.147]
99 Proof of Theorem 3 Proof We would like to prove that the policy defined in Section 6 gives close to optimal performance. [sent-1183, score-0.189]
100 α,λ Let πK be a policy that selects α-greedy actions. [sent-1189, score-0.189]
wordName wordTfidf (topN-words)
[('tv', 0.565), ('fvi', 0.292), ('vk', 0.282), ('vmax', 0.221), ('policy', 0.189), ('mdp', 0.168), ('tvk', 0.135), ('unos', 0.135), ('zepesv', 0.135), ('inite', 0.128), ('itted', 0.128), ('ari', 0.127), ('rxi', 0.121), ('bellman', 0.12), ('mdps', 0.114), ('ounds', 0.109), ('teration', 0.109), ('jxi', 0.1), ('vf', 0.1), ('tsitsiklis', 0.094), ('lip', 0.092), ('ime', 0.089), ('fn', 0.087), ('dy', 0.086), ('rmax', 0.076), ('reward', 0.073), ('kakade', 0.071), ('bounds', 0.067), ('tvmax', 0.064), ('def', 0.06), ('roy', 0.059), ('discounted', 0.059), ('munos', 0.054), ('variant', 0.052), ('state', 0.052), ('transition', 0.051), ('basepoints', 0.05), ('iterates', 0.049), ('szepesv', 0.048), ('inherent', 0.047), ('samples', 0.046), ('bounded', 0.044), ('operator', 0.044), ('policies', 0.043), ('polynomials', 0.043), ('sup', 0.043), ('concentrability', 0.043), ('covering', 0.042), ('bounding', 0.04), ('lemma', 0.04), ('rewards', 0.04), ('bound', 0.04), ('holds', 0.039), ('action', 0.039), ('measurable', 0.039), ('ak', 0.037), ('pointwise', 0.037), ('smoothness', 0.036), ('ft', 0.035), ('approximation', 0.034), ('bertsekas', 0.033), ('lipschitz', 0.033), ('shall', 0.033), ('xt', 0.033), ('greedy', 0.032), ('exponent', 0.032), ('iteration', 0.031), ('log', 0.031), ('hence', 0.03), ('horizon', 0.03), ('error', 0.03), ('counterexamples', 0.03), ('fix', 0.03), ('xi', 0.029), ('logvmax', 0.028), ('shreve', 0.028), ('xmax', 0.028), ('metric', 0.028), ('inequality', 0.028), ('pollard', 0.028), ('gy', 0.028), ('kth', 0.027), ('contraction', 0.027), ('sk', 0.026), ('stationary', 0.026), ('supremum', 0.026), ('rl', 0.026), ('proof', 0.025), ('langford', 0.025), ('markovian', 0.025), ('rt', 0.025), ('fresh', 0.024), ('gordon', 0.024), ('hungarian', 0.024), ('inf', 0.024), ('iterate', 0.024), ('yt', 0.023), ('tted', 0.023), ('hoeffding', 0.022), ('lp', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
2 0.23810366 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
3 0.14619491 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
Author: Avraham Bab, Ronen I. Brafman
Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the field, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to specific tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games
Author: Yang-Bo He, Zhi Geng
Abstract: The causal discovery from data is important for various scientific investigations. Because we cannot distinguish the different directed acyclic graphs (DAGs) in a Markov equivalence class learned from observational data, we have to collect further information on causal structures from experiments with external interventions. In this paper, we propose an active learning approach for discovering causal structures in which we first find a Markov equivalence class from observational data, and then we orient undirected edges in every chain component via intervention experiments separately. In the experiments, some variables are manipulated through external interventions. We discuss two kinds of intervention experiments, randomized experiment and quasi-experiment. Furthermore, we give two optimal designs of experiments, a batch-intervention design and a sequential-intervention design, to minimize the number of manipulated variables and the set of candidate structures based on the minimax and the maximum entropy criteria. We show theoretically that structural learning can be done locally in subgraphs of chain components without need of checking illegal v-structures and cycles in the whole network and that a Markov equivalence subclass obtained after each intervention can still be depicted as a chain graph. Keywords: active learning, causal networks, directed acyclic graphs, intervention, Markov equivalence class, optimal design, structural learning
7 0.064475693 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
8 0.059857339 52 jmlr-2008-Learning from Multiple Sources
9 0.052246675 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
10 0.05011956 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
11 0.049487699 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
12 0.043970611 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
13 0.041325606 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
14 0.041038547 92 jmlr-2008-Universal Multi-Task Kernels
15 0.037832242 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
16 0.037383031 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
17 0.034922607 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
18 0.034573719 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
19 0.034416571 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
20 0.034291521 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
topicId topicWeight
[(0, 0.212), (1, -0.019), (2, -0.117), (3, -0.133), (4, -0.186), (5, -0.387), (6, -0.143), (7, -0.14), (8, 0.1), (9, 0.025), (10, 0.079), (11, -0.112), (12, 0.151), (13, 0.039), (14, -0.134), (15, 0.063), (16, -0.187), (17, -0.129), (18, -0.111), (19, -0.047), (20, -0.01), (21, -0.055), (22, 0.116), (23, -0.023), (24, -0.156), (25, 0.147), (26, 0.062), (27, -0.182), (28, -0.072), (29, 0.034), (30, 0.096), (31, 0.107), (32, -0.061), (33, 0.082), (34, 0.051), (35, 0.0), (36, -0.004), (37, -0.007), (38, -0.088), (39, 0.034), (40, 0.123), (41, -0.013), (42, 0.066), (43, 0.078), (44, -0.035), (45, -0.043), (46, -0.051), (47, -0.072), (48, 0.046), (49, 0.054)]
simIndex simValue paperId paperTitle
same-paper 1 0.93127137 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
2 0.78689563 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
3 0.51255435 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
Author: Avraham Bab, Ronen I. Brafman
Abstract: Multi Agent Reinforcement Learning (MARL) has received continually growing attention in the past decade. Many algorithms that vary in their approaches to the different subtasks of MARL have been developed. However, the theoretical convergence results for these algorithms do not give a clue as to their practical performance nor supply insights to the dynamics of the learning process itself. This work is a comprehensive empirical study conducted on MGS, a simulation system developed for this purpose. It surveys the important algorithms in the field, demonstrates the strengths and weaknesses of the different approaches to MARL through application of FriendQ, OAL, WoLF, FoeQ, Rmax, and other algorithms to a variety of fully cooperative and fully competitive domains in self and heterogeneous play, and supplies an informal analysis of the resulting learning processes. The results can aid in the design of new learning algorithms, in matching existing algorithms to specific tasks, and may guide further research and formal analysis of the learning processes. Keywords: reinforcement learning, multi-agent reinforcement learning, stochastic games
Author: Imhoi Koo, Rhee Man Kil
Abstract: This paper presents a new method of model selection for regression problems using the modulus of continuity. For this purpose, we suggest the prediction risk bounds of regression models using the modulus of continuity which can be interpreted as the complexity of functions. We also present the model selection criterion referred to as the modulus of continuity information criterion (MCIC) which is derived from the suggested prediction risk bounds. The suggested MCIC provides a risk estimate using the modulus of continuity for a trained regression model (or an estimation function) while other model selection criteria such as the AIC and BIC use structural information such as the number of training parameters. As a result, the suggested MCIC is able to discriminate the performances of trained regression models, even with the same structure of training models. To show the effectiveness of the proposed method, the simulation for function approximation using the multilayer perceptrons (MLPs) was conducted. Through the simulation for function approximation, it was demonstrated that the suggested MCIC provides a good selection tool for nonlinear regression models, even with the limited size of data. Keywords: regression models, multilayer perceptrons, model selection, information criteria, modulus of continuity
7 0.26603881 52 jmlr-2008-Learning from Multiple Sources
8 0.25750279 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning
9 0.20371583 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
10 0.18991482 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
11 0.17153391 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
12 0.16926113 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
13 0.16201283 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
14 0.15844885 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
15 0.1547261 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
16 0.1528503 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
17 0.14192501 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
18 0.13976866 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.13912939 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
20 0.13656156 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
topicId topicWeight
[(0, 0.044), (5, 0.019), (31, 0.01), (33, 0.3), (40, 0.031), (51, 0.014), (54, 0.149), (58, 0.04), (66, 0.056), (76, 0.021), (78, 0.018), (88, 0.083), (92, 0.061), (94, 0.041), (99, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.76725352 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
Author: Rémi Munos, Csaba Szepesvári
Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical findings. Keywords: fitted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control
2 0.55879486 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
3 0.54295313 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting
Author: David Mease, Abraham Wyner
Abstract: The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this paper we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms in general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new methods that are derived from the statistical view. We perform carefully designed experiments using simple simulation models to illustrate some of these flaws and their practical consequences. Keywords: boosting algorithms, LogitBoost, AdaBoost
4 0.53646082 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments
Author: Balázs Csanád Csáji, László Monostori
Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms
5 0.47223109 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett
Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random fields ∗. These authors contributed equally. c 2008 Michael Col
6 0.47204679 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
7 0.45243916 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
8 0.45243818 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
9 0.44660935 56 jmlr-2008-Magic Moments for Structured Output Prediction
10 0.44361642 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
11 0.44291002 83 jmlr-2008-Robust Submodular Observation Selection
12 0.44210342 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
13 0.44183704 86 jmlr-2008-SimpleMKL
14 0.43881965 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
15 0.43803728 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
16 0.43585789 9 jmlr-2008-Active Learning by Spherical Subdivision
17 0.43544805 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.43410218 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management
19 0.4325527 58 jmlr-2008-Max-margin Classification of Data with Absent Features