nips nips2009 nips2009-16 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vijay Desai, Vivek Farias, Ciamac C. Moallemi
Abstract: We present a novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems. LP approaches to approximate DP naturally restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program – the ‘smoothed approximate linear program’ – relaxes this restriction in an appropriate fashion while remaining computationally tractable. Doing so appears to have several advantages: First, we demonstrate superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. Second, experiments with our approach on a challenging problem (the game of Tetris) show that the approach outperforms the existing LP approach (which has previously been shown to be competitive with several ADP algorithms) by an order of magnitude. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems. [sent-8, score-0.173]
2 LP approaches to approximate DP naturally restrict attention to approximations that are lower bounds to the optimal cost-to-go function. [sent-9, score-0.118]
3 Our program – the ‘smoothed approximate linear program’ – relaxes this restriction in an appropriate fashion while remaining computationally tractable. [sent-10, score-0.148]
4 Doing so appears to have several advantages: First, we demonstrate superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. [sent-11, score-0.117]
5 1 Introduction Many dynamic optimization problems can be cast as Markov decision problems (MDPs) and solved, in principle, via dynamic programming. [sent-13, score-0.076]
6 ADP algorithms seek to compute good approximations to the dynamic programing optimal cost-to-go function within the span of some pre-specified set of basis functions. [sent-16, score-0.174]
7 The program employed in the ALP approach is identical to the LP used for exact computation of the optimal cost-to-go function, with further constraints limiting solutions to the low-dimensional subspace spanned by the basis functions used. [sent-18, score-0.197]
8 The resulting low dimensional LP implicitly restricts attention to approximations that are lower bounds on the optimal cost-to-go function. [sent-19, score-0.101]
9 While the structure of this program appears crucial in establishing approximation guarantees for the approach, the restriction to lower bounds leads one to ask whether the ALP is the ‘right’ LP. [sent-20, score-0.239]
10 In particular, could an appropriate relaxation of the feasible region of the ALP allow for better approximations to the cost-to-go function while remaining computationally tractable? [sent-21, score-0.132]
11 Motivated by this question, the present paper presents a new linear program for ADP we call the ‘smoothed’ ALP (or SALP). [sent-22, score-0.081]
12 The SALP may be viewed as a relaxation of the ALP wherein one is allowed to violate the ALP constraints for any given state. [sent-23, score-0.063]
13 A user defined ‘violation budget’ parameter controls the ‘expected’ violation across states; a budget of 0 thus yields the original ALP. [sent-24, score-0.269]
14 We specify a choice of this violation budget that yields a relaxation with attractive properties. [sent-25, score-0.318]
15 In particular, we are able to establish strong approximation guarantees for the SALP; these guarantees are substantially stronger than the corresponding guarantees for the ALP. [sent-26, score-0.156]
16 The number of constraints and variables in the SALP scale with the size of the MDP state space. [sent-27, score-0.069]
17 We nonetheless establish sample complexity bounds that demonstrate that an 1 appropriate ‘sampled’ SALP provides a good approximation to the SALP solution with a tractable number of sampled MDP states. [sent-28, score-0.246]
18 This sampled program is no more complex than the ‘sampled’ ALP and, as such, we demonstrate that the SALP is essentially no harder to solve than the ALP. [sent-29, score-0.153]
19 In detailed comparisons with the ALP, we estimate that the SALP provides an order of magnitude improvement over controllers designed via that approach for the game of Tetris. [sent-32, score-0.06]
20 2 Problem Formulation Our setting is that of a discrete-time, discounted infinite-horizon, cost-minimizing MDP with a finite state space X and finite action space A. [sent-33, score-0.073]
21 Given the state and action at time t, xt and at , a per-stage cost g(xt , at ) is incurred. [sent-34, score-0.076]
22 A stationary policy µ : X → A is a mapping that determines the action at each time as a function of the state. [sent-36, score-0.099]
23 Given each initial state x0 = x, the expected discounted cost (cost-to-go function) of the policy µ is given by Jµ (x) Eµ ∞ αt g(xt , µ(xt )) x0 = x . [sent-37, score-0.144]
24 Denote by Pµ ∈ RX ×X the transition probability matrix for the policy µ, whose (x, x )th entry is Pµ(x) (x, x ). [sent-39, score-0.085]
25 Then, the cost-to-go function Jµ is the unique solution to the equation Tµ J = J, where the operator Tµ is defined by Tµ J = gµ + αPµ J. [sent-41, score-0.059]
26 It is readily shown that the optimal cost-to-go function J ∗ is the unique solution to Bellman’s equation and that a corresponding optimal policy µ∗ is greedy with respect to J ∗ ; i. [sent-44, score-0.179]
27 The above program is indeed an LP since the constraint J(x) ≤ (T J)(x) is equivalent to the set of linear constraints J(x) ≤ g(x, a) + α x ∈X Pa (x, x )J(x ), ∀ a ∈ A. [sent-49, score-0.137]
28 Then, every feasible point for (1) is a component-wise lower bound to J ∗ , and J ∗ is the unique optimal solution to the exact LP (1). [sent-52, score-0.138]
29 For problems where X is prohibitively large, an ADP algorithm seeks to find a good approximation to J ∗ . [sent-53, score-0.061]
30 Specifically, one considers a collection of basis functions {φ1 , . [sent-54, score-0.058]
31 φK ] to be a matrix with columns consisting of basis functions, one seeks an approximation of the form Jr = Φr, with the hope that Jr ∼ J ∗ . [sent-61, score-0.106]
32 In other words, the approximate cost-to-go function is necessarily a point-wise lower bound to the true cost-to-go function in the span of Φ. [sent-67, score-0.083]
33 Figure 1: A cartoon illustrating the feasible set and optimal solution for the ALP and SALP, in the case of a two-state MDP. [sent-70, score-0.126]
34 A careful relaxation from the feasible set of the ALP to that of the SALP can yield an improved approximation. [sent-72, score-0.066]
35 3 The Smoothed ALP The J ≤ T J constraints in the exact LP, which carry over to the ALP, impose a strong restriction on the cost-to-go function approximation: in particular they restrict us to approximations that are lower bounds to J ∗ at every point in the state space. [sent-73, score-0.16]
36 In the case where the state space is very large, and the number of basis functions is (relatively) small, it may be the case that constraints arising from rarely visited or pathological states are binding and influence the optimal solution. [sent-74, score-0.174]
37 In many cases, our ultimate goal is not to find a lower bound on the optimal cost-to-go function, but rather a good approximation to J ∗ . [sent-75, score-0.109]
38 In these instances, it may be the case that relaxing the constraints in the ALP so as not to require a uniform lower bound may allow for better overall approximations to the optimal cost-to-go function. [sent-76, score-0.147]
39 Relaxing the feasible region of the ALP in Figure 1(b) to the light gray region in Figure 1(b) would yield the point ΦrSALP as an optimal solution. [sent-78, score-0.1]
40 The relaxation in this case is clearly beneficial; it allows us to compute a better approximation to J ∗ than the point ΦrSALP . [sent-79, score-0.079]
41 The smoothed approximate linear program (SALP) is given by: (3) maximize r,s ν Φr subject to Φr ≤ T Φr + s, π s ≤ θ, s ≥ 0. [sent-81, score-0.179]
42 For each state x, s(x) is a non-negative decision variable (a slack) that allows for violation of the corresponding ALP constraint. [sent-83, score-0.223]
43 The parameter π ∈ RX is a probability distribution known as the constraint violation distribution. [sent-85, score-0.192]
44 The parameter θ is thus a violation budget: the expected violation of the Φr ≤ T Φr constraint, under the distribution π, must be less than θ. [sent-86, score-0.334]
45 Our analysis will present two types of results: First, we prove approximation guarantees (Sections 4. [sent-94, score-0.078]
46 2) that will indicate that the SALP computes approximations that are of comparable quality to the projection of J ∗ on the linear span of Φ. [sent-96, score-0.075]
47 3) that an implementable ‘sampled’ version of the SALP may be used to approximate the SALP with a tractable number of samples. [sent-98, score-0.058]
48 In the case of the ALP, one either assumes the ability to solve a linear program with as many constraints as there are states, or absent that, knowledge of a certain idealized sampling distribution, so that one can then proceed with solving a ‘sampled’ version of the ALP. [sent-101, score-0.17]
49 Our analysis of the SALP in this section is predicated on the knowledge of an idealized constraint violation distribution, which is this same idealized sampling distribution. [sent-102, score-0.308]
50 The distribution πµ∗ ,ν may be interpreted as yielding the discounted expected frequency of visits to a given state when the initial state is distributed according to ν and the system runs under the optimal policy µ∗ . [sent-105, score-0.208]
51 We note that the ‘sampled’ ALP introduced by de Farias and Van Roy [2] requires access to states sampled according to precisely this distribution. [sent-106, score-0.109]
52 1 A Simple Approximation Guarantee We present a first, simple approximation guarantee for the following specialization of the SALP in (3): maximize r,s (4) ν Φr subject to Φr ≤ T Φr + s, πµ∗ ,ν s ≤ θ, s ≥ 0. [sent-108, score-0.129]
53 Before we proceed to state our result, we define a useful function: (r, θ) (5) minimize γ s,γ subject to Φr − T Φr ≤ s + γ1, πµ∗ ,ν s ≤ θ, s ≥ 0. [sent-109, score-0.062]
54 Armed with this definition, we are now in a position to state our first, crude approximation guarantee: 4 Theorem 1. [sent-116, score-0.085]
55 1−α ,θ)+2θ as the approximation error associated The above theorem allows us to interpret (r 1−α with the SALP solution r. [sent-121, score-0.098]
56 This is precisely Theorem 2 in de Farias and Van Roy [1]; we recover their approximation guarantee for the ALP. [sent-124, score-0.097]
57 In the event that Φr∗ (x) − T Φr∗ (x) is large for only a small number of states (that is, the Bellman error of the approximation produced by r∗ is large for only a small number of states), we thus expect to have a choice of θ for which l(r∗ , θ) + 2θ l(r∗ , 0). [sent-126, score-0.098]
58 Thus, Theorem 1 reinforces the intuition (shown via Figure 1) that the SALP will permit closer approximations to J ∗ than the ALP. [sent-127, score-0.059]
59 Since it is unlikely that the basis functions Φ will provide a uniformly good approximation over the entire state space, the right hand side of our bound could be quite large. [sent-130, score-0.168]
60 While we do not show this here, this choice allows us to choose regions of the state space where we would like a better approximation of J ∗ . [sent-133, score-0.102]
61 Our guarantee does not suggest a concrete choice of the violation budget, θ. [sent-136, score-0.217]
62 2 A Better Approximation Guarantee With the intent of deriving stronger approximation guarantees, we begin this section by introducing a ‘nicer’ measure of the quality of approximation afforded by Φ. [sent-139, score-0.107]
63 The weighting function ψ allows us to weight approximation error in a non-uniform fashion across the state space and in this manner potentially ignore approximation quality in regions of the state space that ‘don’t matter’. [sent-141, score-0.218]
64 In addition to specifying the constraint violation distribution π as we did for our previous bound, we will specify (implicitly) a particular choice of the violation budget θ. [sent-142, score-0.478]
65 For every ψ ∈ Ψ, let β(ψ) = maxµ Then, for an optimal solution (rSALP , s) to (6), we have: ¯ J ∗ − ΦrSALP 1,ν ≤ inf r,ψ∈Ψ J ∗ − Φr ∞,1/ψ 5 ν ψ+ 2(πµ∗ ,ν ψ + 1)(αβ(ψ) + 1) 1−α Pµ ψ ψ . [sent-147, score-0.071]
66 Thus, the approximation guarantee of Theorem 2 allows us to view the SALP as automating the critical procedure of identifying a good Lyapunov function for a given problem. [sent-154, score-0.08]
67 3 Sample Complexity Our analysis thus far has assumed we have the ability to solve the SALP, a program with a potentially intractable number of constraints and variables. [sent-156, score-0.112]
68 As it turns out, a solution to the SALP is well approximated by the solution to a certain ‘sampled’ program which we now ˆ describe: Let X = {x1 , x2 , . [sent-157, score-0.139]
69 Let us consider solving the following program which we call the sampled SALP: maximize r,s (7) ν Φr − 2 (1−α)S ˆ x∈X s(x) ˆ subject to (Φr)(x) ≤ (T Φr)(x) + s(x), ∀ x ∈ X , r ∈ N , s ≥ 0. [sent-161, score-0.188]
70 Notice that (7) is a linear program with S variables and S|A| constraints. [sent-163, score-0.081]
71 Under the conditions of Theorem 2, let rSALP be an optimal solution to the SALP (6), and let rSALP be an optimal solution to the sampled SALP (7). [sent-169, score-0.168]
72 Further, given ∈ (0, B] and δ ∈ (0, 1/2], suppose that the number of sampled states S satisfies 16eB 8 64B 2 2(K + 2) log + log . [sent-171, score-0.092]
73 1−α Theorem 3 establishes that the sampled SALP provides a close approximation to the solution of the SALP, in the sense that the approximation guarantees we established for the SALP are approximately valid for the solution to the sampled version with high probability. [sent-173, score-0.313]
74 This number depends linearly on the number of basis functions and the diameter of the 1 This includes those ψ that do not satisfy the Lyapunov condition β(ψ) ≤ 1/α. [sent-175, score-0.061]
75 6 feasible region, but is otherwise independent of the size of the state space for the MDP under consideration. [sent-176, score-0.072]
76 Exploiting the fact that the ALP has a small number of variables, de Farias and Van Roy [2] establish a sample complexity bound for a sampled version of the ALP analogous (7). [sent-179, score-0.13]
77 The number of samples required for this sampled ALP to produce a good approximation to the ALP can be shown to depend on the same problem parameters we have identified here, viz. [sent-180, score-0.105]
78 The sample complexity in that case is identical to the sample complexity bound established here up to constants and an additional multiplicative factor of B/ (for the sampled SALP). [sent-182, score-0.16]
79 Thus, the two sample complexity bounds are within polynomial terms of each other and we have established that the SALP is essentially no harder to solve than the ALP. [sent-183, score-0.091]
80 This section places the SALP on solid theoretical ground by establishing strong approximation guarantees for the SALP that represent a substantial improvement over those available for the ALP and sample complexity results that indicated that the SALP was implementable via sampling. [sent-184, score-0.17]
81 Tetris represents precisely the kind of large and unstructured MDP for which it is difficult to design heuristic controllers, and hence policies designed by ADP algorithms are particularly relevant. [sent-188, score-0.057]
82 Given a sample size S, a collection X ⊂ X of S states was sampled. [sent-194, score-0.063]
83 Given the collection X of sampled states, an increasing sequence of choices of the violation budget θ ≥ 0 is considered. [sent-197, score-0.34]
84 For each choice of θ, the optimization program maximize r,s (8) 1 S ˆ x∈X (Φr)(x) ˆ subject to Φr(x) ≤ T Φr(x) + s(x), ∀ x ∈ X , 1 ˆ s(x) ≤ θ, x∈X S ˆ s(x) ≥ 0, ∀ x ∈ X, was solved. [sent-198, score-0.147]
85 This program is a version of the original SALP (3), but with sampled empirical distributions in place of the state-relevance weights ν and the constraint violation distribution π. [sent-199, score-0.331]
86 Given a vector of weights obtained by solving (8), the performance of the corresponding policy is evaluated via Monte Carlo simulation over 3, 000 games of Tetris. [sent-203, score-0.085]
87 For each pair (S, θ), the resulting average performance (averaged over 10 different sets of sampled states) is shown in Figure 2. [sent-205, score-0.058]
88 It provides experimental evidence for the intuition expressed in Section 3 and the analytic result of Theorem 1: Relaxing the constraints of the ALP by allowing for a violation budget allows for better policy performance. [sent-206, score-0.4]
89 As the violation budget θ is increased from 0, performance dramatically improves. [sent-207, score-0.269]
90 2 Our baseline policy had an average performance of 113 points. [sent-210, score-0.085]
91 00256 θ = 0 (ALP) 4 2 0 50 100 150 200 250 300 ×103 Sample Size S Figure 2: Average performance of SALP for different values of the number of sampled states S and the violation budget θ. [sent-216, score-0.361]
92 Algorithm Best Performance CPU Time ALP TD-Learning [7] ALP with bootstrapping [3] TD-Learning [8] Policy gradient [9] SALP 897 3,183 4,274 4,471 5,500 10,775 hours minutes hours minutes days hours Table 1: Comparison of the performance of the best policy found with various ADP methods. [sent-221, score-0.142]
93 Note that significantly better policies are possible with this basis function architecture than any of the ADP algorithms in Table 1 discover. [sent-222, score-0.085]
94 In addition, the approach employs a number of rather arbitrary Tetris specific ‘modifications’ that are ultimately seen to be critical to performance - in the absence of these modifications, the method is unable to find a policy for Tetris that scores above a few hundred points. [sent-225, score-0.085]
95 2 are approximation guarantees, which provide bounds on the approximation error given by the SALP approach versus the best approximation possible with the particular set of basis functions. [sent-229, score-0.217]
96 These provide bounds on the performance of the resulting SALP policies, as a function of the basis architecture. [sent-231, score-0.076]
97 Rather than solving a large linear program, such an algorithm would optimize a policy in an online fashion along a single system trajectory. [sent-233, score-0.105]
98 However, a sample path SALP variation would inherit all of the theoretical bounds developed here. [sent-235, score-0.059]
99 On constraint sampling in the linear programming approach to approximate dynamic programming. [sent-247, score-0.087]
100 Temporal differences–based policy iteration and applications in neuro–dynamic programming. [sent-275, score-0.085]
wordName wordTfidf (topN-words)
[('salp', 0.754), ('alp', 0.444), ('tetris', 0.192), ('adp', 0.168), ('violation', 0.167), ('rsalp', 0.123), ('farias', 0.121), ('budget', 0.102), ('policy', 0.085), ('program', 0.081), ('ralp', 0.069), ('bellman', 0.06), ('sampled', 0.058), ('idealized', 0.058), ('lp', 0.055), ('roy', 0.053), ('approximation', 0.047), ('basis', 0.045), ('mdp', 0.044), ('rx', 0.043), ('lyapunov', 0.041), ('policies', 0.04), ('state', 0.038), ('van', 0.036), ('feasible', 0.034), ('states', 0.034), ('guarantee', 0.033), ('di', 0.032), ('relaxation', 0.032), ('approximations', 0.032), ('smoothed', 0.032), ('bounds', 0.031), ('constraints', 0.031), ('guarantees', 0.031), ('span', 0.03), ('solution', 0.029), ('dynamic', 0.029), ('ciamac', 0.027), ('orded', 0.027), ('optimal', 0.026), ('constraint', 0.025), ('maximize', 0.025), ('xt', 0.024), ('subject', 0.024), ('szita', 0.024), ('bound', 0.024), ('columbia', 0.023), ('game', 0.022), ('relaxing', 0.022), ('implementable', 0.022), ('theorem', 0.022), ('establishing', 0.021), ('discounted', 0.021), ('cartoon', 0.021), ('controllers', 0.021), ('jr', 0.021), ('fashion', 0.02), ('region', 0.02), ('bertsekas', 0.019), ('hours', 0.019), ('tractable', 0.019), ('decision', 0.018), ('choice', 0.017), ('erent', 0.017), ('operator', 0.017), ('illustrated', 0.017), ('improvement', 0.017), ('approximate', 0.017), ('precisely', 0.017), ('sample', 0.016), ('complexity', 0.016), ('diameter', 0.016), ('programming', 0.016), ('inf', 0.016), ('restriction', 0.016), ('illustrating', 0.016), ('establish', 0.016), ('weighting', 0.015), ('intuition', 0.015), ('appropriate', 0.014), ('harder', 0.014), ('spanned', 0.014), ('established', 0.014), ('action', 0.014), ('side', 0.014), ('seeks', 0.014), ('unique', 0.013), ('quality', 0.013), ('collection', 0.013), ('cpu', 0.012), ('lower', 0.012), ('argmaxx', 0.012), ('assurance', 0.012), ('cleared', 0.012), ('demaine', 0.012), ('inherit', 0.012), ('pat', 0.012), ('programing', 0.012), ('queueing', 0.012), ('reinforces', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 16 nips-2009-A Smoothed Approximate Linear Program
Author: Vijay Desai, Vivek Farias, Ciamac C. Moallemi
Abstract: We present a novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems. LP approaches to approximate DP naturally restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program – the ‘smoothed approximate linear program’ – relaxes this restriction in an appropriate fashion while remaining computationally tractable. Doing so appears to have several advantages: First, we demonstrate superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. Second, experiments with our approach on a challenging problem (the game of Tetris) show that the approach outperforms the existing LP approach (which has previously been shown to be competitive with several ADP algorithms) by an order of magnitude. 1
2 0.28650439 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably finds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We therefore employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem. 1 Motivation Solving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging problem addressed widely in the AI literature, particularly in the area of reinforcement learning. It is widely accepted that large MDPs can only be solved approximately. The commonly used approximation methods can be divided into three broad categories: 1) policy search, which explores a restricted space of all policies, 2) approximate dynamic programming, which searches a restricted space of value functions, and 3) approximate linear programming, which approximates the solution using a linear program. While all of these methods have achieved impressive results in many domains, they have significant limitations. Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a finite-state controller [22] or as a greedy policy with respect to an approximate value function [24]. Policy search methods have achieved impressive results in such domains as Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution. Approximate dynamic programming (ADP) methods iteratively approximate the value function [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods. However, ADP methods typically do not converge and they only provide weak guarantees of approximation quality. The approximation error bounds are usually expressed in terms of the worst-case approximation of the value function over all policies [4]. In addition, most available bounds are with respect to the L∞ norm, while the algorithms often minimize the L2 norm. While there exist some L2 -based bounds [14], they require values that are difficult to obtain. Approximate linear programming (ALP) uses a linear program to compute the approximate value function in a particular vector space [7]. ALP has been previously used in a wide variety of settings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent 1 efforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is guaranteed to converge and return the closest L1 -norm approximation v of the optimal value func˜ tion v ∗ up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee a small policy loss, and there is no reliable method for selecting appropriate weights [7]. To summarize, the existing reinforcement learning techniques often provide good solutions, but typically require significant domain knowledge [20]. The domain knowledge is needed partly because useful a priori error bounds are not available, as mentioned above. Our goal is to develop a more robust method that is guaranteed to minimize an actual bound on the policy loss. We present a new formulation of value function approximation that provably minimizes a bound on the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values that are hard to obtain. The new method unifies policy search and value-function search methods to minimize the L∞ norm of the Bellman residual, which bounds the policy loss. We start with a description of the framework and notation in Section 2. Then, in Section 3, we describe the proposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is its computational complexity, which may be exponential. We show in Section 4 that this is unavoidable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is on the formulation and its properties, we also discuss some simple algorithms for solving bilinear programs. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Policy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement learning benchmark problem. A complete discussion of sampling strategies–an essential component for achieving robustness–is beyond the scope of this paper, but the issue is briefly discussed in Section 6. Complete proofs of the theorems can be found in [19]. 2 Solving MDPs using ALP In this section, we formally define MDPs, their ALP formulation, and the approximation errors involved. These notions serve as a basis for developing the ABP formulation. A Markov Decision Process is a tuple (S, A, P, r, α), where S is the finite set of states, A is the finite set of actions. P : S × S × A → [0, 1] is the transition function, where P (s , s, a) represents the probability of transiting to state s from state s, given action a. The function r : S × A → R is the reward function, and α : S → [0, 1] is the initial state distribution. The objective is to maximize the infinite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary ordering of the states: s1 , s2 , . . . , sn . Then, Pa and ra are used to denote the probabilistic transition matrix and reward for action a. The solution of an MDP is a policy π : S × A → [0, 1] from a set of possible policies Π, such that for all s ∈ S, a∈A π(s, a) = 1. We assume that the policies may be stochastic, but stationary [21]. A policy is deterministic when π(s, a) ∈ {0, 1} for all s ∈ S and a ∈ A. The transition and reward functions for a given policy are denoted by Pπ and rπ . The value function update for a policy π is denoted by Lπ , and the Bellman operator is denoted by L. That is: Lπ v = Pπ v + rπ Lv = max Lπ v. π∈Π The optimal value function, denoted v ∗ , satisfies v ∗ = Lv ∗ . We focus on linear value function approximation for discounted infinite-horizon problems. In linear value function approximation, the value function is represented as a linear combination of nonlinear basis functions (vectors). For each state s, we define a row-vector φ(s) of features. The rows of the basis matrix M correspond to φ(s), and the approximation space is generated by the columns of the matrix. That is, the basis matrix M , and the value function v are represented as: − φ(s1 ) − M = − φ(s2 ) − v = M x. . . . Definition 1. A value function, v, is representable if v ∈ M ⊆ R|S| , where M = colspan (M ), and is transitive-feasible when v ≥ Lv. We denote the set of transitive-feasible value functions as: K = {v ∈ R|S| v ≥ Lv}. 2 Notice that the optimal value function v ∗ is transitive-feasible, and M is a linear space. Also, all the inequalities are element-wise. Because the new formulation is related to ALP, we introduce it first. It is well known that an infinite horizon discounted MDP problem may be formulated in terms of solving the following linear program: minimize v c(s)v(s) s∈S v(s) − γ s.t. P (s , s, a)v(s ) ≥ r(s, a) ∀(s, a) ∈ (S, A) (1) s ∈S We use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value c represents a distribution over the states, usually a uniform one. That is, s∈S c(s) = 1. The linear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an approximate linear program by assuming that v ∈ M [8], as follows: minimize cT v x Av ≥ b s.t. (2) v∈M The constraint v ∈ M denotes the approximation. To actually solve this linear program, the value function is represented as v = M x. In the remainder of the paper, we assume that 1 ∈ M to guarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP, v , satisfies that v ≥ v ∗ . Then, the objective of Eq. (2) represents the minimization of v − v ∗ 1,c , ˜ ˜ ˜ where · 1,c is a c-weighted L1 norm [7]. The ultimate goal of the optimization is not to obtain a good value function v , but a good policy. ˜ The quality of the policy, typically chosen to be greedy with respect to v , depends non-trivially on ˜ the approximate value function. The ABP formulation will minimize policy loss by minimizing L˜ − v ∞ , which bounds the policy loss as follows. v ˜ Theorem 2 (e.g. [25]). Let v be an arbitrary value function, and let v be the value of the greedy ˜ ˆ policy with respect to v . Then: ˜ 2 v∗ − v ∞ ≤ ˆ L˜ − v ∞ , v ˜ 1−γ In addition, if v ≥ L˜, the policy loss is smallest for the greedy policy. ˜ v Policies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the state-action pairs, such that o(s, a) → N maps a state and an action to its position. The policies are represented as θ ∈ R|S|×|A| , and we use the shorthand notation θ(s, a) = θ(o(s, a)). Remark 3. The corresponding π and θ are denoted as π θ and θπ and satisfy: π θ (s, a) = θπ (s, a). We will also consider approximations of the policies in the policy-space, generated by columns of a matrix N . A policy is representable when π ∈ N , where N = colspan (N ). 3 Approximate Bilinear Programs This section shows how to formulate minv∈M Lv − v ∞ as a separable bilinear program. Bilinear programs are a generalization of linear programs with an additional bilinear term in the objective function. A separable bilinear program consists of two linear programs with independent constraints and are fairly easy to solve and analyze. Definition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is defined as follows: T T minimize f (w, x, y, z) = sT w + r1 x + xT Cy + r2 y + sT z 1 2 w,x y,z s.t. A1 x + B1 w = b1 A2 y + B2 z = b2 w, x ≥ 0 y, z ≥ 0 3 (3) We separate the variables using a vertical line and the constraints using different columns to emphasize the separable nature of the bilinear program. In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. An approximate bilinear program can now be formulated as follows. minimize θT λ + λ θ λ,λ ,v Bθ = 1 z = Av − b s.t. θ≥0 z≥0 (4) λ+λ1≥z λ≥0 θ∈N v∈M All variables are vectors except λ , which is a scalar. The symbol z is only used to simplify the notation and does not need to represent an optimization variable. The variable v is defined for each state and represents the value function. Matrix A represents constraints that are identical to the constraints in Eq. (2). The variables λ correspond to all state-action pairs. These variables represent the Bellman residuals that are being minimized. The variables θ are defined for all state-action pairs and represent policies in Remark 3. The matrix B represents the following constraints: θ(s, a) = 1 ∀s ∈ S. a∈A As with approximate linear programs, we initially assume that all the constraints on z are used. In realistic settings, however, the constraints would be sampled or somehow reduced. We defer the discussion of this issue until Section 6. Note that the constraints in our formulation correspond to elements of z and θ. Thus when constraints are omitted, also the corresponding elements of z and θ are omitted. To simplify the notation, the value function approximation in this problem is denoted only implicitly by v ∈ M, and the policy approximation is denoted by θ ∈ N . In an actual implementation, the optimization variables would be x, y using the relationships v = M x and θ = N y. We do not assume any approximation of the policy space, unless mentioned otherwise. We also use v or θ to refer to partial solutions of Eq. (4) with the other variables chosen appropriately to achieve feasibility. The ABP formulation is closely related to approximate linear programs, and we discuss the connection in Section 5. We first analyze the properties of the optimal solutions of the bilinear program and then show and discuss the solution methods in Section 4. The following theorem states the main property of the bilinear formulation. ˜˜ ˜ ˜ Theorem 5. b Let (θ, v , λ, λ ) be an optimal solution of Eq. (4) and assume that 1 ∈ M. Then: ˜ ˜ ˜ θT λ + λ = L˜ − v v ˜ ∞ ≤ min v∈K∩M Lv − v ∞ ≤ 2 min Lv − v v∈M ∞ ≤ 2(1 + γ) min v − v ∗ v∈M ∞. ˜ In addition, π θ minimizes the Bellman residual with regard to v , and its value function v satisfies: ˜ ˆ 2 min Lv − v ∞ . v − v∗ ∞ ≤ ˆ 1 − γ v∈M The proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the ABP approach is equivalent to a minimization over all representable value functions, not only the transitive-feasible ones. Notice also the missing coefficient 2 (2 instead of 4) in the last equation of Theorem 5. This follows by subtracting a constant vector 1 from v to balance the lower bounds ˜ on the Bellman residual error with the upper ones. This modified approximate value function will have 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever v ∗ ∈ M, both ABP and ALP will return the optimal value function. The ABP solution minimizes the L∞ norm of the Bellman residual due to: 1) the correspondence between θ and the policies, and 2) the dual representation with respect to variables λ and λ . The theorem then follows using techniques similar to those used for approximate linear programs [7]. 4 Algorithm 1: Iterative algorithm for solving Eq. (3) (x0 , w0 ) ← random ; (y0 , z0 ) ← arg miny,z f (w0 , x0 , y, z) ; i←1; while yi−1 = yi or xi−1 = xi do (yi , zi ) ← arg min{y,z A2 y+B2 z=b2 y,z≥0} f (wi−1 , xi−1 , y, z) ; (xi , wi ) ← arg min{x,w A1 x+B1 w=b1 x,w≥0} f (w, x, yi , zi ) ; i←i+1 return f (wi , xi , yi , zi ) 4 Solving Bilinear Programs In this section we describe simple methods for solving ABPs. We first describe optimal methods, which have exponential complexity, and then discuss some approximation strategies. Solving a bilinear program is an NP-complete problem [3]. The membership in NP follows from the finite number of basic feasible solutions of the individual linear programs, each of which can be checked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3]. The NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. However, most other ADP algorithms are not guaranteed to converge to a solution in finite time. The following theorem shows that the computational complexity of the ABP formulation is asymptotically the same as the complexity of the problem it solves. Theorem 6. b Determining minv∈K∩M Lv − v ∞ < is NP-complete for the full constraint representation, 0 < γ < 1, and a given > 0. In addition, the problem remains NP-complete when 1 ∈ M, and therefore minv∈M Lv − v ∞ < is also NP-complete. As the theorem states, the value function approximation does not become computationally simpler even when 1 ∈ M – a universal assumption in the paper. Notice that ALP can determine whether minv∈K∩M Lv − v ∞ = 0 in polynomial time. The proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the reduction determines the true literal in each clause, and the approximate value function corresponds to the truth value of the literals. The approximation basis forces literals that share the same variable to have consistent values. Bilinear programs are non-convex and are typically solved using global optimization techniques. The common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP settings with a small number of features, the successive approximation algorithm [17] may be applied efficiently. We are, however, not aware of commercial solvers available for solving bilinear programs. Bilinear programs can be formulated as concave quadratic minimization problems [11], or mixed integer linear programs [11, 16], for which there are numerous commercial solvers available. Because we are interested in solving very large bilinear programs, we describe simple approximate algorithms next. Optimal scalable methods are beyond the scope of this paper. The most common approximate method for solving bilinear programs is shown in Algorithm 1. It is designed for the general formulation shown in Eq. (3), where f (w, x, y, z) represents the objective function. The minimizations in the algorithm are linear programs which can be easily solved. Interestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a version of API. While Algorithm 1 is not guaranteed to find an optimal solution, its empirical performance is often remarkably good [13]. Its basic properties are summarized by the following proposition. Proposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program solutions are in a vertex of the optimality simplex. In addition, the global optimum is a fixed point of the algorithm, and the objective value monotonically improves during execution. 5 The proof is based on the finite count of the basic feasible solutions of the individual linear programs. Because the objective function does not increase in any iteration, the algorithm will eventually converge. In the context of MDPs, Algorithm 1 can be further refined. For example, the constraint v ∈ M in Eq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still be acceptable. The following proposition motivates the construction of a new value function from two transitive-feasible value functions. Proposition 8. Let v1 and v2 be feasible value functions in Eq. (4). Then the value function ˜ ˜ v (s) = min{˜1 (s), v2 (s)} is also feasible in Eq. (4). Therefore v ≥ v ∗ and v ∗ − v ∞ ≤ ˜ v ˜ ˜ ˜ min { v ∗ − v1 ∞ , v ∗ − v2 ∞ }. ˜ ˜ The proof of the proposition is based on Jensen’s inequality and can be found in [19]. Proposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the state-wise minimum of values from multiple random executions of Algorithm 1, which preserves the transitive feasibility of the value function. However, the increasing number of value functions used to obtain v also increases the potential sampling error. ˜ 5 Relationship to ALP and API In this section, we describe the important connections between ABP and the two closely related ADP methods: ALP, and API with L∞ minimization. Both of these methods are commonly used, for example to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties and leads to a new convergent form of API. ABP addresses some important issues with ALP: 1) ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss, 2) ALP’s solution quality depends significantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete constraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP, however, is the higher computational complexity. Both the first and the second issues in ALP can be addressed by choosing the right objective function [7]. Because this objective function depends on the optimal ALP solution, it cannot be practically computed. Instead, various heuristics are usually used. The heuristic objective functions may lead to significant improvements in specific domains, but they do not provide any guarantees. ABP, on the other hand, has no such parameters that require adjustments. The third issue arises when the constraints of an ALP need to be sampled in some large domains. The ALP may become unbounded with incomplete samples because its objective value is defined using the L1 norm on the states, and the constraints are defined using the L∞ norm of the Bellman residual. In ABP, the Bellman residual is used in both the constraints and objective function. The objective function of ABP is then bounded below by 0 for an arbitrarily small number of samples. ABP can also improve on API with L∞ minimization (L∞ -API for short), which is a leading method for solving factored MDPs [10]. Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds exist for API with the L2 norm minimization [14], such as LSPI [12]. L∞ -API is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. (I − γPπ )v + 1φ ≥ rπ −(I − γPπ )v + 1φ ≥ −rπ (5) v∈M Here I denotes the identity matrix. We are not aware of a convergence or a divergence proof of L∞ -API, and this analysis is beyond the scope of this paper. 6 Algorithm 2: Approximate policy iteration, where f (π) denotes a custom value function approximation for the policy π. π0 , k ← rand, 1 ; while πk = πk−1 do vk ← f (πk−1 ) ; ˜ πk (s) ← arg maxa∈A r(s, a) + γ s ∈S P (s , s, a)˜k (s) ∀s ∈ S ; v k ←k+1 We propose Optimistic Approximate Policy Iteration (OAPI), a modification of API. OAPI is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. Av ≥ b (≡ (I − γPπ )v ≥ rπ ∀π ∈ Π) −(I − γPπ )v + 1φ ≥ −rπ (6) v∈M In fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4) with fixed θ. Then, using Proposition 7, we get the following corollary. Corollary 9. Optimistic approximate policy iteration converges in finite time. In addition, the Bellman residual of the generated value functions monotonically decreases. OAPI differs from L∞ -API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below and by φ from above, and then it minimizes φ. L∞ -API constrains the Bellman residuals by φ from both above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. L∞ -API cannot return an approximate value function that has a lower Bellman residual than ABP, given the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP algorithm, performs comparably to L∞ -API, as the following theorem states. Theorem 10. b Assume that L∞ -API converges to a policy π and a value function v that both φ satisfy: φ = v − Lπ v ∞ = v − Lv ∞ . Then v = v + 1−γ 1 is feasible in Eq. (4), and it is a fixed ˜ point of OAPI. In addition, the greedy policies with respect to v and v are identical. ˜ The proof is based on two facts. First, v is feasible with respect to the constraints in Eq. (4). The ˜ Bellman residual changes for all the policies identically, since a constant vector is added. Second, because Lπ is greedy with respect to v , we have that v ≥ Lπ v ≥ L˜. The value function v is ˜ ˜ ˜ v ˜ therefore transitive-feasible. The full proof can be found in [19]. To summarize, OAPI guarantees convergence, while matching the performance of L∞ -API. The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always non-negative, it can progressively reduce it. In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. Theorem 10 also explains why API provides better solutions than ALP, as observed in [10]. From the discussion above, ALP can be seen as an L1 -norm approximation of a single iteration of OAPI. L∞ -API, on the other hand, performs many such ALP-like iterations. 6 Empirical Evaluation As we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can perform as well as existing state-of-the art methods on factored MDPs. However, a deeper understanding of the formulation and potential solution methods will be necessary in order to determine the full practical impact of the proposed methods. In this section, we validate the approach by applying it to the mountain car problem, a simple reinforcement learning benchmark problem. We have so far considered that all the constraints involving z are present in the ABP in Eq. (4). Because the constraints correspond to all state-action pairs, it is often impractical to even enumerate 7 (a) L∞ error of the Bellman residual Features 100 144 OAPI 0.21 (0.23) 0.13 (0.1) ALP 13. (13.) 3.6 (4.3) LSPI 9. (14.) 3.9 (7.7) API 0.46 (0.08) 0.86 (1.18) (b) L2 error of the Bellman residual Features 100 144 OAPI 0.2 (0.3) 0.1 (1.9) ALP 9.5 (18.) 0.3 (0.4) LSPI 1.2 (1.5) 0.9 (0.1) API 0.04 (0.01) 0.08 (0.08) Table 1: Bellman residual of the final value function. The values are averages over 5 executions, with the standard deviations shown in parentheses. them. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the constraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds can be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number of constraints. Such a reduction is possible, for example, in factored MDPs with L∞ -API and ALP [10], and can be easily extended to OAPI and ABP. In the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it first needs to back up to an opposite hill to gain sufficient momentum. The car receives a reward of 1 when it climbs the hill. In the experiments we used a discount factor γ = 0.99. The experiments are designed to determine whether OAPI reliably minimizes the Bellman residual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the value function. The constraints were based on 200 uniformly sampled states with all 3 actions per state. We evaluated the methods with the number of the approximation features 100 and 144, which corresponds to the number of linear segments. The results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted in Table 1. The results are shown for both L∞ norm and uniformly-weighted L2 norm. The runtimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number of iterations of OAPI. The results demonstrate that ABP minimizes the L∞ Bellman residual much more consistently than the other methods. Note, however, that all the considered algorithms would perform significantly better given a finer approximation. 7 Conclusion and Future Work We proposed and analyzed approximate bilinear programming, a new value-function approximation method, which provably minimizes the L∞ Bellman residual. ABP returns the optimal approximate value function with respect to the Bellman residual bounds, despite the formulation with regard to transitive-feasible value functions. We also showed that there is no asymptotically simpler formulation, since finding the closest value function and solving a bilinear program are both NP-complete problems. Finally, the formulation leads to the development of OAPI, a new convergent form of API which monotonically improves the objective value function. While we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may render optimal solution methods feasible. ABPs have a small number of essential variables (that determine the value function) and a large number of constraints, which can be leveraged by the solvers [15]. The L∞ error bound provides good theoretical guarantees, but it may be too conservative in practice. A similar formulation based on L2 norm minimization may be more practical. We believe that the proposed formulation will help to deepen the understanding of value function approximation and the characteristics of existing solution methods, and potentially lead to the development of more robust and widely-applicable reinforcement learning algorithms. Acknowledgements This work was supported by the Air Force Office of Scientific Research under Grant No. FA955008-1-0171. We also thank the anonymous reviewers for their useful comments. 8 References [1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, pages 1–8, 2006. [2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Research, 52:499–514, 2004. [3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical report, Computer Science Department, University of Wisconsin, 1992. [4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997. [5] Guiuseppe Calafiore and M.C. Campi. Uncertain convex programs: Randomized solutions and confidence levels. Mathematical Programming, Series A, 102:25–46, 2005. [6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Mathematical Programming Series A, 118:75–108, 2009. [7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application. PhD thesis, Stanford University, 2002. [8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003. [9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29(3):462–478, 2004. [10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003. [11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996. [12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. [13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program. Journal of Global Optimization, 12:1–7, 1995. [14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560–567, 2003. [15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear programs. In Conference on Artificial Intelligence, pages 750–755, 2007. [16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision processes. In International Joint Conference on Artificial Intelligence, pages 1997–2002, 2007. [17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent planning. Journal of Artificial Intelligence Research, 35:235–274, 2009. [18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In International Conference on Machine Learning, pages 809–816, 2009. [19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear programming. Technical Report UM-CS-2009-052, Department of Computer Science, University of Massachusetts Amherst, 2009. [20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007. [21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005. [22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research, 21:63–100, 2004. [23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998. [24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):2936–2941, 2006. [25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994. 9
3 0.060344931 113 nips-2009-Improving Existing Fault Recovery Policies
Author: Guy Shani, Christopher Meek
Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1
4 0.057424493 221 nips-2009-Solving Stochastic Games
Author: Liam M. Dermed, Charles L. Isbell
Abstract: Solving multi-agent reinforcement learning problems has proven difficult because of the lack of tractable algorithms. We provide the first approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modified Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and find the computational cost to be orders of magnitude less than what the theory predicts. 1
5 0.044966903 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1
6 0.044627089 31 nips-2009-An LP View of the M-best MAP problem
7 0.043645076 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions
8 0.043520905 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
9 0.041125406 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
10 0.035101578 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
11 0.031584334 208 nips-2009-Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Matrices via Convex Optimization
12 0.031527523 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
13 0.030911632 242 nips-2009-The Infinite Partially Observable Markov Decision Process
14 0.028671073 11 nips-2009-A General Projection Property for Distribution Families
15 0.028593795 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
16 0.028454637 54 nips-2009-Compositionality of optimal control laws
17 0.02627537 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
18 0.026217015 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations
19 0.026166361 53 nips-2009-Complexity of Decentralized Control: Special Cases
20 0.025482694 7 nips-2009-A Data-Driven Approach to Modeling Choice
topicId topicWeight
[(0, -0.088), (1, 0.052), (2, 0.064), (3, -0.07), (4, -0.106), (5, 0.008), (6, 0.034), (7, -0.023), (8, 0.02), (9, 0.001), (10, -0.019), (11, 0.023), (12, -0.014), (13, 0.04), (14, -0.019), (15, -0.051), (16, -0.017), (17, -0.034), (18, 0.007), (19, -0.007), (20, 0.017), (21, -0.13), (22, 0.169), (23, 0.004), (24, -0.252), (25, -0.011), (26, -0.184), (27, 0.037), (28, -0.023), (29, 0.035), (30, -0.009), (31, -0.01), (32, -0.057), (33, -0.042), (34, 0.049), (35, 0.054), (36, 0.027), (37, 0.076), (38, 0.018), (39, -0.114), (40, -0.065), (41, 0.002), (42, -0.109), (43, 0.009), (44, 0.206), (45, -0.226), (46, 0.048), (47, -0.008), (48, -0.047), (49, 0.158)]
simIndex simValue paperId paperTitle
same-paper 1 0.92373765 16 nips-2009-A Smoothed Approximate Linear Program
Author: Vijay Desai, Vivek Farias, Ciamac C. Moallemi
Abstract: We present a novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems. LP approaches to approximate DP naturally restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program – the ‘smoothed approximate linear program’ – relaxes this restriction in an appropriate fashion while remaining computationally tractable. Doing so appears to have several advantages: First, we demonstrate superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. Second, experiments with our approach on a challenging problem (the game of Tetris) show that the approach outperforms the existing LP approach (which has previously been shown to be competitive with several ADP algorithms) by an order of magnitude. 1
2 0.85850954 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably finds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We therefore employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem. 1 Motivation Solving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging problem addressed widely in the AI literature, particularly in the area of reinforcement learning. It is widely accepted that large MDPs can only be solved approximately. The commonly used approximation methods can be divided into three broad categories: 1) policy search, which explores a restricted space of all policies, 2) approximate dynamic programming, which searches a restricted space of value functions, and 3) approximate linear programming, which approximates the solution using a linear program. While all of these methods have achieved impressive results in many domains, they have significant limitations. Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a finite-state controller [22] or as a greedy policy with respect to an approximate value function [24]. Policy search methods have achieved impressive results in such domains as Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution. Approximate dynamic programming (ADP) methods iteratively approximate the value function [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods. However, ADP methods typically do not converge and they only provide weak guarantees of approximation quality. The approximation error bounds are usually expressed in terms of the worst-case approximation of the value function over all policies [4]. In addition, most available bounds are with respect to the L∞ norm, while the algorithms often minimize the L2 norm. While there exist some L2 -based bounds [14], they require values that are difficult to obtain. Approximate linear programming (ALP) uses a linear program to compute the approximate value function in a particular vector space [7]. ALP has been previously used in a wide variety of settings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent 1 efforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is guaranteed to converge and return the closest L1 -norm approximation v of the optimal value func˜ tion v ∗ up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee a small policy loss, and there is no reliable method for selecting appropriate weights [7]. To summarize, the existing reinforcement learning techniques often provide good solutions, but typically require significant domain knowledge [20]. The domain knowledge is needed partly because useful a priori error bounds are not available, as mentioned above. Our goal is to develop a more robust method that is guaranteed to minimize an actual bound on the policy loss. We present a new formulation of value function approximation that provably minimizes a bound on the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values that are hard to obtain. The new method unifies policy search and value-function search methods to minimize the L∞ norm of the Bellman residual, which bounds the policy loss. We start with a description of the framework and notation in Section 2. Then, in Section 3, we describe the proposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is its computational complexity, which may be exponential. We show in Section 4 that this is unavoidable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is on the formulation and its properties, we also discuss some simple algorithms for solving bilinear programs. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Policy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement learning benchmark problem. A complete discussion of sampling strategies–an essential component for achieving robustness–is beyond the scope of this paper, but the issue is briefly discussed in Section 6. Complete proofs of the theorems can be found in [19]. 2 Solving MDPs using ALP In this section, we formally define MDPs, their ALP formulation, and the approximation errors involved. These notions serve as a basis for developing the ABP formulation. A Markov Decision Process is a tuple (S, A, P, r, α), where S is the finite set of states, A is the finite set of actions. P : S × S × A → [0, 1] is the transition function, where P (s , s, a) represents the probability of transiting to state s from state s, given action a. The function r : S × A → R is the reward function, and α : S → [0, 1] is the initial state distribution. The objective is to maximize the infinite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary ordering of the states: s1 , s2 , . . . , sn . Then, Pa and ra are used to denote the probabilistic transition matrix and reward for action a. The solution of an MDP is a policy π : S × A → [0, 1] from a set of possible policies Π, such that for all s ∈ S, a∈A π(s, a) = 1. We assume that the policies may be stochastic, but stationary [21]. A policy is deterministic when π(s, a) ∈ {0, 1} for all s ∈ S and a ∈ A. The transition and reward functions for a given policy are denoted by Pπ and rπ . The value function update for a policy π is denoted by Lπ , and the Bellman operator is denoted by L. That is: Lπ v = Pπ v + rπ Lv = max Lπ v. π∈Π The optimal value function, denoted v ∗ , satisfies v ∗ = Lv ∗ . We focus on linear value function approximation for discounted infinite-horizon problems. In linear value function approximation, the value function is represented as a linear combination of nonlinear basis functions (vectors). For each state s, we define a row-vector φ(s) of features. The rows of the basis matrix M correspond to φ(s), and the approximation space is generated by the columns of the matrix. That is, the basis matrix M , and the value function v are represented as: − φ(s1 ) − M = − φ(s2 ) − v = M x. . . . Definition 1. A value function, v, is representable if v ∈ M ⊆ R|S| , where M = colspan (M ), and is transitive-feasible when v ≥ Lv. We denote the set of transitive-feasible value functions as: K = {v ∈ R|S| v ≥ Lv}. 2 Notice that the optimal value function v ∗ is transitive-feasible, and M is a linear space. Also, all the inequalities are element-wise. Because the new formulation is related to ALP, we introduce it first. It is well known that an infinite horizon discounted MDP problem may be formulated in terms of solving the following linear program: minimize v c(s)v(s) s∈S v(s) − γ s.t. P (s , s, a)v(s ) ≥ r(s, a) ∀(s, a) ∈ (S, A) (1) s ∈S We use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value c represents a distribution over the states, usually a uniform one. That is, s∈S c(s) = 1. The linear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an approximate linear program by assuming that v ∈ M [8], as follows: minimize cT v x Av ≥ b s.t. (2) v∈M The constraint v ∈ M denotes the approximation. To actually solve this linear program, the value function is represented as v = M x. In the remainder of the paper, we assume that 1 ∈ M to guarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP, v , satisfies that v ≥ v ∗ . Then, the objective of Eq. (2) represents the minimization of v − v ∗ 1,c , ˜ ˜ ˜ where · 1,c is a c-weighted L1 norm [7]. The ultimate goal of the optimization is not to obtain a good value function v , but a good policy. ˜ The quality of the policy, typically chosen to be greedy with respect to v , depends non-trivially on ˜ the approximate value function. The ABP formulation will minimize policy loss by minimizing L˜ − v ∞ , which bounds the policy loss as follows. v ˜ Theorem 2 (e.g. [25]). Let v be an arbitrary value function, and let v be the value of the greedy ˜ ˆ policy with respect to v . Then: ˜ 2 v∗ − v ∞ ≤ ˆ L˜ − v ∞ , v ˜ 1−γ In addition, if v ≥ L˜, the policy loss is smallest for the greedy policy. ˜ v Policies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the state-action pairs, such that o(s, a) → N maps a state and an action to its position. The policies are represented as θ ∈ R|S|×|A| , and we use the shorthand notation θ(s, a) = θ(o(s, a)). Remark 3. The corresponding π and θ are denoted as π θ and θπ and satisfy: π θ (s, a) = θπ (s, a). We will also consider approximations of the policies in the policy-space, generated by columns of a matrix N . A policy is representable when π ∈ N , where N = colspan (N ). 3 Approximate Bilinear Programs This section shows how to formulate minv∈M Lv − v ∞ as a separable bilinear program. Bilinear programs are a generalization of linear programs with an additional bilinear term in the objective function. A separable bilinear program consists of two linear programs with independent constraints and are fairly easy to solve and analyze. Definition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is defined as follows: T T minimize f (w, x, y, z) = sT w + r1 x + xT Cy + r2 y + sT z 1 2 w,x y,z s.t. A1 x + B1 w = b1 A2 y + B2 z = b2 w, x ≥ 0 y, z ≥ 0 3 (3) We separate the variables using a vertical line and the constraints using different columns to emphasize the separable nature of the bilinear program. In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. An approximate bilinear program can now be formulated as follows. minimize θT λ + λ θ λ,λ ,v Bθ = 1 z = Av − b s.t. θ≥0 z≥0 (4) λ+λ1≥z λ≥0 θ∈N v∈M All variables are vectors except λ , which is a scalar. The symbol z is only used to simplify the notation and does not need to represent an optimization variable. The variable v is defined for each state and represents the value function. Matrix A represents constraints that are identical to the constraints in Eq. (2). The variables λ correspond to all state-action pairs. These variables represent the Bellman residuals that are being minimized. The variables θ are defined for all state-action pairs and represent policies in Remark 3. The matrix B represents the following constraints: θ(s, a) = 1 ∀s ∈ S. a∈A As with approximate linear programs, we initially assume that all the constraints on z are used. In realistic settings, however, the constraints would be sampled or somehow reduced. We defer the discussion of this issue until Section 6. Note that the constraints in our formulation correspond to elements of z and θ. Thus when constraints are omitted, also the corresponding elements of z and θ are omitted. To simplify the notation, the value function approximation in this problem is denoted only implicitly by v ∈ M, and the policy approximation is denoted by θ ∈ N . In an actual implementation, the optimization variables would be x, y using the relationships v = M x and θ = N y. We do not assume any approximation of the policy space, unless mentioned otherwise. We also use v or θ to refer to partial solutions of Eq. (4) with the other variables chosen appropriately to achieve feasibility. The ABP formulation is closely related to approximate linear programs, and we discuss the connection in Section 5. We first analyze the properties of the optimal solutions of the bilinear program and then show and discuss the solution methods in Section 4. The following theorem states the main property of the bilinear formulation. ˜˜ ˜ ˜ Theorem 5. b Let (θ, v , λ, λ ) be an optimal solution of Eq. (4) and assume that 1 ∈ M. Then: ˜ ˜ ˜ θT λ + λ = L˜ − v v ˜ ∞ ≤ min v∈K∩M Lv − v ∞ ≤ 2 min Lv − v v∈M ∞ ≤ 2(1 + γ) min v − v ∗ v∈M ∞. ˜ In addition, π θ minimizes the Bellman residual with regard to v , and its value function v satisfies: ˜ ˆ 2 min Lv − v ∞ . v − v∗ ∞ ≤ ˆ 1 − γ v∈M The proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the ABP approach is equivalent to a minimization over all representable value functions, not only the transitive-feasible ones. Notice also the missing coefficient 2 (2 instead of 4) in the last equation of Theorem 5. This follows by subtracting a constant vector 1 from v to balance the lower bounds ˜ on the Bellman residual error with the upper ones. This modified approximate value function will have 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever v ∗ ∈ M, both ABP and ALP will return the optimal value function. The ABP solution minimizes the L∞ norm of the Bellman residual due to: 1) the correspondence between θ and the policies, and 2) the dual representation with respect to variables λ and λ . The theorem then follows using techniques similar to those used for approximate linear programs [7]. 4 Algorithm 1: Iterative algorithm for solving Eq. (3) (x0 , w0 ) ← random ; (y0 , z0 ) ← arg miny,z f (w0 , x0 , y, z) ; i←1; while yi−1 = yi or xi−1 = xi do (yi , zi ) ← arg min{y,z A2 y+B2 z=b2 y,z≥0} f (wi−1 , xi−1 , y, z) ; (xi , wi ) ← arg min{x,w A1 x+B1 w=b1 x,w≥0} f (w, x, yi , zi ) ; i←i+1 return f (wi , xi , yi , zi ) 4 Solving Bilinear Programs In this section we describe simple methods for solving ABPs. We first describe optimal methods, which have exponential complexity, and then discuss some approximation strategies. Solving a bilinear program is an NP-complete problem [3]. The membership in NP follows from the finite number of basic feasible solutions of the individual linear programs, each of which can be checked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3]. The NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. However, most other ADP algorithms are not guaranteed to converge to a solution in finite time. The following theorem shows that the computational complexity of the ABP formulation is asymptotically the same as the complexity of the problem it solves. Theorem 6. b Determining minv∈K∩M Lv − v ∞ < is NP-complete for the full constraint representation, 0 < γ < 1, and a given > 0. In addition, the problem remains NP-complete when 1 ∈ M, and therefore minv∈M Lv − v ∞ < is also NP-complete. As the theorem states, the value function approximation does not become computationally simpler even when 1 ∈ M – a universal assumption in the paper. Notice that ALP can determine whether minv∈K∩M Lv − v ∞ = 0 in polynomial time. The proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the reduction determines the true literal in each clause, and the approximate value function corresponds to the truth value of the literals. The approximation basis forces literals that share the same variable to have consistent values. Bilinear programs are non-convex and are typically solved using global optimization techniques. The common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP settings with a small number of features, the successive approximation algorithm [17] may be applied efficiently. We are, however, not aware of commercial solvers available for solving bilinear programs. Bilinear programs can be formulated as concave quadratic minimization problems [11], or mixed integer linear programs [11, 16], for which there are numerous commercial solvers available. Because we are interested in solving very large bilinear programs, we describe simple approximate algorithms next. Optimal scalable methods are beyond the scope of this paper. The most common approximate method for solving bilinear programs is shown in Algorithm 1. It is designed for the general formulation shown in Eq. (3), where f (w, x, y, z) represents the objective function. The minimizations in the algorithm are linear programs which can be easily solved. Interestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a version of API. While Algorithm 1 is not guaranteed to find an optimal solution, its empirical performance is often remarkably good [13]. Its basic properties are summarized by the following proposition. Proposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program solutions are in a vertex of the optimality simplex. In addition, the global optimum is a fixed point of the algorithm, and the objective value monotonically improves during execution. 5 The proof is based on the finite count of the basic feasible solutions of the individual linear programs. Because the objective function does not increase in any iteration, the algorithm will eventually converge. In the context of MDPs, Algorithm 1 can be further refined. For example, the constraint v ∈ M in Eq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still be acceptable. The following proposition motivates the construction of a new value function from two transitive-feasible value functions. Proposition 8. Let v1 and v2 be feasible value functions in Eq. (4). Then the value function ˜ ˜ v (s) = min{˜1 (s), v2 (s)} is also feasible in Eq. (4). Therefore v ≥ v ∗ and v ∗ − v ∞ ≤ ˜ v ˜ ˜ ˜ min { v ∗ − v1 ∞ , v ∗ − v2 ∞ }. ˜ ˜ The proof of the proposition is based on Jensen’s inequality and can be found in [19]. Proposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the state-wise minimum of values from multiple random executions of Algorithm 1, which preserves the transitive feasibility of the value function. However, the increasing number of value functions used to obtain v also increases the potential sampling error. ˜ 5 Relationship to ALP and API In this section, we describe the important connections between ABP and the two closely related ADP methods: ALP, and API with L∞ minimization. Both of these methods are commonly used, for example to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties and leads to a new convergent form of API. ABP addresses some important issues with ALP: 1) ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss, 2) ALP’s solution quality depends significantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete constraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP, however, is the higher computational complexity. Both the first and the second issues in ALP can be addressed by choosing the right objective function [7]. Because this objective function depends on the optimal ALP solution, it cannot be practically computed. Instead, various heuristics are usually used. The heuristic objective functions may lead to significant improvements in specific domains, but they do not provide any guarantees. ABP, on the other hand, has no such parameters that require adjustments. The third issue arises when the constraints of an ALP need to be sampled in some large domains. The ALP may become unbounded with incomplete samples because its objective value is defined using the L1 norm on the states, and the constraints are defined using the L∞ norm of the Bellman residual. In ABP, the Bellman residual is used in both the constraints and objective function. The objective function of ABP is then bounded below by 0 for an arbitrarily small number of samples. ABP can also improve on API with L∞ minimization (L∞ -API for short), which is a leading method for solving factored MDPs [10]. Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds exist for API with the L2 norm minimization [14], such as LSPI [12]. L∞ -API is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. (I − γPπ )v + 1φ ≥ rπ −(I − γPπ )v + 1φ ≥ −rπ (5) v∈M Here I denotes the identity matrix. We are not aware of a convergence or a divergence proof of L∞ -API, and this analysis is beyond the scope of this paper. 6 Algorithm 2: Approximate policy iteration, where f (π) denotes a custom value function approximation for the policy π. π0 , k ← rand, 1 ; while πk = πk−1 do vk ← f (πk−1 ) ; ˜ πk (s) ← arg maxa∈A r(s, a) + γ s ∈S P (s , s, a)˜k (s) ∀s ∈ S ; v k ←k+1 We propose Optimistic Approximate Policy Iteration (OAPI), a modification of API. OAPI is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. Av ≥ b (≡ (I − γPπ )v ≥ rπ ∀π ∈ Π) −(I − γPπ )v + 1φ ≥ −rπ (6) v∈M In fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4) with fixed θ. Then, using Proposition 7, we get the following corollary. Corollary 9. Optimistic approximate policy iteration converges in finite time. In addition, the Bellman residual of the generated value functions monotonically decreases. OAPI differs from L∞ -API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below and by φ from above, and then it minimizes φ. L∞ -API constrains the Bellman residuals by φ from both above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. L∞ -API cannot return an approximate value function that has a lower Bellman residual than ABP, given the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP algorithm, performs comparably to L∞ -API, as the following theorem states. Theorem 10. b Assume that L∞ -API converges to a policy π and a value function v that both φ satisfy: φ = v − Lπ v ∞ = v − Lv ∞ . Then v = v + 1−γ 1 is feasible in Eq. (4), and it is a fixed ˜ point of OAPI. In addition, the greedy policies with respect to v and v are identical. ˜ The proof is based on two facts. First, v is feasible with respect to the constraints in Eq. (4). The ˜ Bellman residual changes for all the policies identically, since a constant vector is added. Second, because Lπ is greedy with respect to v , we have that v ≥ Lπ v ≥ L˜. The value function v is ˜ ˜ ˜ v ˜ therefore transitive-feasible. The full proof can be found in [19]. To summarize, OAPI guarantees convergence, while matching the performance of L∞ -API. The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always non-negative, it can progressively reduce it. In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. Theorem 10 also explains why API provides better solutions than ALP, as observed in [10]. From the discussion above, ALP can be seen as an L1 -norm approximation of a single iteration of OAPI. L∞ -API, on the other hand, performs many such ALP-like iterations. 6 Empirical Evaluation As we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can perform as well as existing state-of-the art methods on factored MDPs. However, a deeper understanding of the formulation and potential solution methods will be necessary in order to determine the full practical impact of the proposed methods. In this section, we validate the approach by applying it to the mountain car problem, a simple reinforcement learning benchmark problem. We have so far considered that all the constraints involving z are present in the ABP in Eq. (4). Because the constraints correspond to all state-action pairs, it is often impractical to even enumerate 7 (a) L∞ error of the Bellman residual Features 100 144 OAPI 0.21 (0.23) 0.13 (0.1) ALP 13. (13.) 3.6 (4.3) LSPI 9. (14.) 3.9 (7.7) API 0.46 (0.08) 0.86 (1.18) (b) L2 error of the Bellman residual Features 100 144 OAPI 0.2 (0.3) 0.1 (1.9) ALP 9.5 (18.) 0.3 (0.4) LSPI 1.2 (1.5) 0.9 (0.1) API 0.04 (0.01) 0.08 (0.08) Table 1: Bellman residual of the final value function. The values are averages over 5 executions, with the standard deviations shown in parentheses. them. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the constraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds can be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number of constraints. Such a reduction is possible, for example, in factored MDPs with L∞ -API and ALP [10], and can be easily extended to OAPI and ABP. In the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it first needs to back up to an opposite hill to gain sufficient momentum. The car receives a reward of 1 when it climbs the hill. In the experiments we used a discount factor γ = 0.99. The experiments are designed to determine whether OAPI reliably minimizes the Bellman residual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the value function. The constraints were based on 200 uniformly sampled states with all 3 actions per state. We evaluated the methods with the number of the approximation features 100 and 144, which corresponds to the number of linear segments. The results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted in Table 1. The results are shown for both L∞ norm and uniformly-weighted L2 norm. The runtimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number of iterations of OAPI. The results demonstrate that ABP minimizes the L∞ Bellman residual much more consistently than the other methods. Note, however, that all the considered algorithms would perform significantly better given a finer approximation. 7 Conclusion and Future Work We proposed and analyzed approximate bilinear programming, a new value-function approximation method, which provably minimizes the L∞ Bellman residual. ABP returns the optimal approximate value function with respect to the Bellman residual bounds, despite the formulation with regard to transitive-feasible value functions. We also showed that there is no asymptotically simpler formulation, since finding the closest value function and solving a bilinear program are both NP-complete problems. Finally, the formulation leads to the development of OAPI, a new convergent form of API which monotonically improves the objective value function. While we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may render optimal solution methods feasible. ABPs have a small number of essential variables (that determine the value function) and a large number of constraints, which can be leveraged by the solvers [15]. The L∞ error bound provides good theoretical guarantees, but it may be too conservative in practice. A similar formulation based on L2 norm minimization may be more practical. We believe that the proposed formulation will help to deepen the understanding of value function approximation and the characteristics of existing solution methods, and potentially lead to the development of more robust and widely-applicable reinforcement learning algorithms. Acknowledgements This work was supported by the Air Force Office of Scientific Research under Grant No. FA955008-1-0171. We also thank the anonymous reviewers for their useful comments. 8 References [1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, pages 1–8, 2006. [2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Research, 52:499–514, 2004. [3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical report, Computer Science Department, University of Wisconsin, 1992. [4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997. [5] Guiuseppe Calafiore and M.C. Campi. Uncertain convex programs: Randomized solutions and confidence levels. Mathematical Programming, Series A, 102:25–46, 2005. [6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Mathematical Programming Series A, 118:75–108, 2009. [7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application. PhD thesis, Stanford University, 2002. [8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003. [9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29(3):462–478, 2004. [10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003. [11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996. [12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. [13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program. Journal of Global Optimization, 12:1–7, 1995. [14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560–567, 2003. [15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear programs. In Conference on Artificial Intelligence, pages 750–755, 2007. [16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision processes. In International Joint Conference on Artificial Intelligence, pages 1997–2002, 2007. [17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent planning. Journal of Artificial Intelligence Research, 35:235–274, 2009. [18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In International Conference on Machine Learning, pages 809–816, 2009. [19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear programming. Technical Report UM-CS-2009-052, Department of Computer Science, University of Massachusetts Amherst, 2009. [20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007. [21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005. [22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research, 21:63–100, 2004. [23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998. [24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):2936–2941, 2006. [25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994. 9
3 0.45656148 113 nips-2009-Improving Existing Fault Recovery Policies
Author: Guy Shani, Christopher Meek
Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1
4 0.39514497 12 nips-2009-A Generalized Natural Actor-Critic Algorithm
Author: Tetsuro Morimura, Eiji Uchibe, Junichiro Yoshimoto, Kenji Doya
Abstract: Policy gradient Reinforcement Learning (RL) algorithms have received substantial attention, seeking stochastic policies that maximize the average (or discounted cumulative) reward. In addition, extensions based on the concept of the Natural Gradient (NG) show promising learning efficiency because these regard metrics for the task. Though there are two candidate metrics, Kakade’s Fisher Information Matrix (FIM) for the policy (action) distribution and Morimura’s FIM for the stateaction joint distribution, but all RL algorithms with NG have followed Kakade’s approach. In this paper, we describe a generalized Natural Gradient (gNG) that linearly interpolates the two FIMs and propose an efficient implementation for the gNG learning based on a theory of the estimating function, the generalized Natural Actor-Critic (gNAC) algorithm. The gNAC algorithm involves a near optimal auxiliary function to reduce the variance of the gNG estimates. Interestingly, the gNAC can be regarded as a natural extension of the current state-of-the-art NAC algorithm [1], as long as the interpolating parameter is appropriately selected. Numerical experiments showed that the proposed gNAC algorithm can estimate gNG efficiently and outperformed the NAC algorithm.
5 0.36433411 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
Author: Shalabh Bhatnagar, Doina Precup, David Silver, Richard S. Sutton, Hamid R. Maei, Csaba Szepesvári
Abstract: We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al. (2009a, 2009b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman error, and algorithms that perform stochastic gradient-descent on this function. These methods can be viewed as natural generalizations to previous TD methods, as they converge to the same limit points when used with linear function approximation methods. We generalize this work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms, for any finite Markov decision process and any smooth value function approximator, to a locally optimal solution. The algorithms are incremental and the computational complexity per time step scales linearly with the number of parameters of the approximator. Empirical results obtained in the game of Go demonstrate the algorithms’ effectiveness. 1
6 0.34399495 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
7 0.3243179 221 nips-2009-Solving Stochastic Games
8 0.32166451 46 nips-2009-Bilinear classifiers for visual recognition
9 0.31496745 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
10 0.30692059 11 nips-2009-A General Projection Property for Distribution Families
11 0.2712554 7 nips-2009-A Data-Driven Approach to Modeling Choice
12 0.23743552 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
13 0.23390603 54 nips-2009-Compositionality of optimal control laws
14 0.23123364 31 nips-2009-An LP View of the M-best MAP problem
15 0.22941373 197 nips-2009-Randomized Pruning: Efficiently Calculating Expectations in Large Dynamic Programs
16 0.22564328 233 nips-2009-Streaming Pointwise Mutual Information
17 0.22253831 234 nips-2009-Streaming k-means approximation
18 0.22218522 138 nips-2009-Learning with Compressible Priors
19 0.216295 180 nips-2009-On the Convergence of the Concave-Convex Procedure
20 0.20304023 78 nips-2009-Efficient Moments-based Permutation Tests
topicId topicWeight
[(21, 0.018), (24, 0.068), (25, 0.053), (35, 0.065), (36, 0.079), (39, 0.037), (58, 0.084), (59, 0.302), (61, 0.039), (71, 0.053), (81, 0.016), (86, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.73653716 16 nips-2009-A Smoothed Approximate Linear Program
Author: Vijay Desai, Vivek Farias, Ciamac C. Moallemi
Abstract: We present a novel linear program for the approximation of the dynamic programming cost-to-go function in high-dimensional stochastic control problems. LP approaches to approximate DP naturally restrict attention to approximations that are lower bounds to the optimal cost-to-go function. Our program – the ‘smoothed approximate linear program’ – relaxes this restriction in an appropriate fashion while remaining computationally tractable. Doing so appears to have several advantages: First, we demonstrate superior bounds on the quality of approximation to the optimal cost-to-go function afforded by our approach. Second, experiments with our approach on a challenging problem (the game of Tetris) show that the approach outperforms the existing LP approach (which has previously been shown to be competitive with several ADP algorithms) by an order of magnitude. 1
2 0.58045113 129 nips-2009-Learning a Small Mixture of Trees
Author: M. P. Kumar, Daphne Koller
Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.
3 0.51616979 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black
Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1
4 0.50804579 3 nips-2009-AUC optimization and the two-sample problem
Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon
Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1
5 0.50778979 113 nips-2009-Improving Existing Fault Recovery Policies
Author: Guy Shani, Christopher Meek
Abstract: An automated recovery system is a key component in a large data center. Such a system typically employs a hand-made controller created by an expert. While such controllers capture many important aspects of the recovery process, they are often not systematically optimized to reduce costs such as server downtime. In this paper we describe a passive policy learning approach for improving existing recovery policies without exploration. We explain how to use data gathered from the interactions of the hand-made controller with the system, to create an improved controller. We suggest learning an indefinite horizon Partially Observable Markov Decision Process, a model for decision making under uncertainty, and solve it using a point-based algorithm. We describe the complete process, starting with data gathering, model learning, model checking procedures, and computing a policy. 1
6 0.50383306 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
7 0.50354779 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability
8 0.50203836 100 nips-2009-Gaussian process regression with Student-t likelihood
9 0.50145578 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
10 0.50089484 71 nips-2009-Distribution-Calibrated Hierarchical Classification
11 0.5004003 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
12 0.50016338 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
13 0.49932629 97 nips-2009-Free energy score space
14 0.49919727 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data
15 0.49888194 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA
16 0.49795586 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization
17 0.49780318 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
18 0.49683344 72 nips-2009-Distribution Matching for Transduction
19 0.49624979 260 nips-2009-Zero-shot Learning with Semantic Output Codes
20 0.49623325 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models